Search Results: "bernat"

19 September 2020

Vincent Bernat: Syncing NetBox with a custom Ansible module

The netbox.netbox collection from Ansible Galaxy provides several modules to update NetBox objects:

- name: create a device in NetBox
  netbox_device:
    netbox_url: http://netbox.local
    netbox_token: s3cret
    data:
      name: to3-p14.sfo1.example.com
      device_type: QFX5110-48S
      device_role: Compute Switch
      site: SFO1

However, if NetBox is not your source of truth, you may want to ensure it stays in sync with your configuration management database¹ by removing outdated devices or IP addresses. While it should be possible to glue together a playbook with a query, a loop and some filtering to delete unwanted elements, it feels clunky, inefficient and an abuse of YAML as a programming language. A specific Ansible module solves this issue and is likely more flexible.

Notice I recommend that you read Writing a custom Ansible module as an introduction, as well as Syncing MySQL tables for a first simpler example.

Code The module has the following signature and it syncs NetBox with the content of the provided YAML file:

netbox_sync:
  source: netbox.yaml
  api: https://netbox.example.com
  token: s3cret

The synchronized objects are:

sites,
manufacturers,
device types,
device roles,
devices, and
IP addresses.

In our environment, the YAML file is generated from our configuration management database and contains a set of devices and a list of IP addresses:

devices:
  ad2-p6.sfo1.example.com:
     datacenter: sfo1
     manufacturer: Cisco
     model: Catalyst 2960G-48TC-L
     role: net_tor_oob_switch
  to1-p6.sfo1.example.com:
     datacenter: sfo1
     manufacturer: Juniper
     model: QFX5110-48S
     role: net_tor_gpu_switch
# [ ]
ips:
  - device: ad2-p6.example.com
    ip: 172.31.115.18/21
    interface: oob
  - device: to1-p6.example.com
    ip: 172.31.115.33/21
    interface: oob
  - device: to1-p6.example.com
    ip: 172.31.254.33/32
    interface: lo0.0
# [ ]

The network team is not the sole tenant in NetBox. While adding new objects or modifying existing ones should be relatively safe, deleting unwanted objects can be risky. The module only deletes objects it did create or modify. To identify them, it marks them with a specific tag, cmdb. Most objects in NetBox accept tags.

Module definition Starting from the skeleton described in the previous article, we define the module:

module_args = dict(
    source=dict(type='path', required=True),
    api=dict(type='str', required=True),
    token=dict(type='str', required=True, no_log=True),
    max_workers=dict(type='int', required=False, default=10)
)
result = dict(
    changed=False
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)

It contains an additional optional arguments defining the number of workers to talk to NetBox and query the existing objects in parallel to speedup the execution.

Abstracting synchronization We need to synchronize different object types, but once we have a list of objects we want in NetBox, the grunt work is always the same:

check if the objects already exist,

retrieve them and put them in a form suitable for comparison,

retrieve the extra objects we don t want anymore,

compare the two sets, and

add missing objects, update existing ones, delete extra ones.

We code these behaviours into a `Synchronizer` abstract class. For each kind of object, a concrete class is built with the appropriate class attributes to tune its behaviour and a `wanted()` method to provide the objects we want. I am not explaining the abstract class code here. Have a look at the source if you want.

Synchronizing tags and tenants As a starter, here is how we define the class synchronizing the tags:

class SyncTags(Synchronizer):
    app = "extras"
    table = "tags"
    key = "name"
    def wanted(self):
        return  "cmdb": dict(
            slug="cmdb",
            color="8bc34a",
            description="synced by network CMDB")

The app and table attributes defines the NetBox objects we want to manipulate. The key attribute is used to determine how to lookup for existing objects. In this example, we want to lookup tags using their names. The wanted() method is expected to return a dictionary mapping object keys to the list of wanted attributes. Here, the keys are tag names and we create only one tag, cmdb, with the provided slug, color and description. This is the tag we will use to mark the objects we create or modify. If the tag does not exist, it is created. If it exists, the provided attributes are updated. Other attributes are left untouched. We also want to create a specific tenant for objects accepting such an attribute (devices and IP addresses):

class SyncTenants(Synchronizer):
    app = "tenancy"
    table = "tenants"
    key = "name"
    def wanted(self):
        return  "Network": dict(slug="network",
                                description="Network team")

Synchronizing sites We also need to synchronize the list of sites. This time, the wanted() method uses the information provided in the YAML file: it walks the devices and builds a set of datacenter names.

class SyncSites(Synchronizer):
    app = "dcim"
    table = "sites"
    key = "name"
    only_on_create = ("status", "slug")
    def wanted(self):
        result = set(details["datacenter"]
                     for details in self.source['devices'].values()
                     if "datacenter" in details)
        return  k: dict(slug=k,
                        status="planned")
                for k in result

Thanks to the use of the only_on_create attribute, the specified attributes are not updated if they are different. The goal of this synchronizer is mostly to collect the references to the different sites for other objects.

>>> pprint(SyncSites(**sync_args).wanted())
 'sfo1':  'slug': 'sfo1', 'status': 'planned' ,
 'chi1':  'slug': 'chi1', 'status': 'planned' ,
 'nyc1':  'slug': 'nyc1', 'status': 'planned'

Synchronizing manufacturers, device types and device roles The synchronization of manufacturers is pretty similar, except we do not use the only_on_create attribute:

class SyncManufacturers(Synchronizer):
    app = "dcim"
    table = "manufacturers"
    key = "name"
    def wanted(self):
        result = set(details["manufacturer"]
                     for details in self.source['devices'].values()
                     if "manufacturer" in details)
        return  k:  "slug": slugify(k) 
                for k in result

Regarding the device types, we use the foreign attribute linking a NetBox attribute to the synchronizer handling it.

class SyncDeviceTypes(Synchronizer):
    app = "dcim"
    table = "device_types"
    key = "model"
    foreign =  "manufacturer": SyncManufacturers 
    def wanted(self):
        result = set((details["manufacturer"], details["model"])
                     for details in self.source['devices'].values()
                     if "model" in details)
        return  k[1]: dict(manufacturer=k[0],
                           slug=slugify(k[1]))
                for k in result

The wanted() method refers to the manufacturer using its key attribute. In this case, this is the manufacturer name.

>>> pprint(SyncManufacturers(**sync_args).wanted())
 'Cisco':  'slug': 'cisco' ,
 'Dell':  'slug': 'dell' ,
 'Juniper':  'slug': 'juniper' 
>>> pprint(SyncDeviceTypes(**sync_args).wanted())
 'ASR 9001':  'manufacturer': 'Cisco', 'slug': 'asr-9001' ,
 'Catalyst 2960G-48TC-L':  'manufacturer': 'Cisco',
                           'slug': 'catalyst-2960g-48tc-l' ,
 'MX10003':  'manufacturer': 'Juniper', 'slug': 'mx10003' ,
 'QFX10002-36Q':  'manufacturer': 'Juniper', 'slug': 'qfx10002-36q' ,
 'QFX10002-72Q':  'manufacturer': 'Juniper', 'slug': 'qfx10002-72q' ,
 'QFX5110-32Q':  'manufacturer': 'Juniper', 'slug': 'qfx5110-32q' ,
 'QFX5110-48S':  'manufacturer': 'Juniper', 'slug': 'qfx5110-48s' ,
 'QFX5200-32C':  'manufacturer': 'Juniper', 'slug': 'qfx5200-32c' ,
 'S4048-ON':  'manufacturer': 'Dell', 'slug': 's4048-on' ,
 'S6010-ON':  'manufacturer': 'Dell', 'slug': 's6010-on'

The device roles are defined like this:

class SyncDeviceRoles(Synchronizer):
    app = "dcim"
    table = "device_roles"
    key = "name"
    def wanted(self):
        result = set(details["role"]
                     for details in self.source['devices'].values()
                     if "role" in details)
        return  k: dict(slug=slugify(k),
                        color="8bc34a")
                for k in result

Synchronizing devices A device is mostly a name with references to a role, a model, a datacenter and a tenant. These references are declared as foreign keys using the synchronizers defined previously.

class SyncDevices(Synchronizer):
    app = "dcim"
    table = "devices"
    key = "name"
    foreign =  "device_role": SyncDeviceRoles,
               "device_type": SyncDeviceTypes,
               "site": SyncSites,
               "tenant": SyncTenants 
    remove_unused = 10
    def wanted(self):
        return  name: dict(device_role=details["role"],
                           device_type=details["model"],
                           site=details["datacenter"],
                           tenant="Network")
                for name, details in self.source['devices'].items()
                if  "datacenter", "model", "role"  <= set(details.keys())

The remove_unused attribute is a safety implemented to fail if we have to delete more than 10 devices: this may be the indication there is a bug somewhere, unless one of your datacenter suddenly caught fire.

>>> pprint(SyncDevices(**sync_args).wanted())
 'ad2-p6.sfo1.example.com':  'device_role': 'net_tor_oob_switch',
                             'device_type': 'Catalyst 2960G-48TC-L',
                             'site': 'sfo1',
                             'tenant': 'Network' ,
 'to1-p6.sfo1.example.com':  'device_role': 'net_tor_gpu_switch',
                             'device_type': 'QFX5110-48S',
                             'site': 'sfo1',
                             'tenant': 'Network' ,
[ ]

Synchronizing IP addresses The last step is to synchronize IP addresses. We do not attach them to a device.² Instead, we specify the device names in the description of the IP address:

class SyncIPs(Synchronizer):
    app = "ipam"
    table = "ip-addresses"
    key = "address"
    foreign =  "tenant": SyncTenants 
    remove_unused = 1000
    def wanted(self):
        wanted =  
        for details in self.source['ips']:
            if details['ip'] in wanted:
                wanted[details['ip']]['description'] = \
                    f" details['device']  (and others)"
            else:
                wanted[details['ip']] = dict(
                    tenant="Network",
                    status="active",
                    dns_name="",        # information is present in DNS
                    description=f" details['device'] :  details['interface'] ",
                    role=None,
                    vrf=None)
        return wanted

There is a slight difficulty: NetBox allows duplicate IP addresses, so a simple lookup is not enough. In case of multiple matches, we choose the best by preferring those tagged with cmdb, then those already attached to an interface:

def get(self, key):
    """Grab IP address from NetBox."""
    # There may be duplicate. We need to grab the "best".
    results = super(Synchronizer, self).get(key)
    if len(results) == 0:
        return None
    if len(results) == 1:
        return results[0]
    scores = [0]*len(results)
    for idx, result in enumerate(results):
        if "cmdb" in result.tags:
            scores[idx] += 10
        if result.interface is not None:
            scores[idx] += 5
    return sorted(zip(scores, results),
                  reverse=True, key=lambda k: k[0])[0][1]

Getting the current and wanted states Each synchronizer is initialized with a reference to the Ansible module, a reference to a pynetbox s API object, the data contained in the provided YAML file and two empty dictionaries for the current and expected states:

source = yaml.safe_load(open(module.params['source']))
netbox = pynetbox.api(module.params['api'],
                      token=module.params['token'])
sync_args = dict(
    module=module,
    netbox=netbox,
    source=source,
    before= ,
    after= 
)
synchronizers = [synchronizer(**sync_args) for synchronizer in [
    SyncTags,
    SyncTenants,
    SyncSites,
    SyncManufacturers,
    SyncDeviceTypes,
    SyncDeviceRoles,
    SyncDevices,
    SyncIPs
]]

Each synchronizer has a prepare() method whose goal is to compute the current and wanted states. It returns True in case of a difference:

# Check what needs to be synchronized
try:
    for synchronizer in synchronizers:
        result['changed']  = synchronizer.prepare()
except AnsibleError as e:
    result['msg'] = e.message
    module.fail_json(**result)

Applying changes Back to the skeleton described in the previous article, the last step is to apply the changes if there is a difference between these states. Each synchronizer registers the current and wanted states in sync_args["before"][table] and sync_args["after"][table] where table is the name of the table for a given NetBox object type. The diff object is a bit elaborate as it is built table by table. This enables Ansible to display the name of each table before the diff representation:

# Compute the diff
if module._diff and result['changed']:
    result['diff'] = [
        dict(
            before_header=table,
            after_header=table,
            before=yaml.safe_dump(sync_args["before"][table]),
            after=yaml.safe_dump(sync_args["after"][table]))
        for table in sync_args["after"]
        if sync_args["before"][table] != sync_args["after"][table]
    ]
# Stop here if check mode is enabled or if no change
if module.check_mode or not result['changed']:
    module.exit_json(**result)

Each synchronizer also exposes a synchronize() method to apply changes and a cleanup() method to delete unwanted objects. Order is important due to the relation between the objects.

# Synchronize
for synchronizer in synchronizers:
    synchronizer.synchronize()
for synchronizer in synchronizers[::-1]:
    synchronizer.cleanup()
module.exit_json(**result)

The complete code is available on GitHub. Compared to using netbox.netbox collection, the logic is written in Python instead of trying to glue Ansible tasks together. I believe this is both more flexible and easier to read, notably when trying to delete outdated objects. While I did not test it, it should also be faster. An alternative would have been to reuse code from the netbox.netbox collection, as it contains similar primitives. Unfortunately, I didn t think of it until now.

In my opinion, a good option for a source of truth is to use YAML files in a Git repository. You get versioning for free and people can get started with a text editor.
This limitation is mostly due to laziness: we do not really care about this information. Our main motivation for putting IP addresses in NetBox is to keep track of the used IP addresses. However, if an IP address is already attached to an interface, we leave this association untouched.

2 September 2020

Vincent Bernat: Syncing MySQL tables with a custom Ansible module

The community.mysql collection from Ansible Galaxy provides a mysql_query module to run arbitrary MySQL queries. Unfortunately, it does not support check mode nor the --diff flag. It is also unable to tell if there was a change. Let s write a specific Ansible module to workaround these issues.

Notice I recommend that you read Writing a custom Ansible module as an introduction.

Code The module has the following signature and it executes the provided SQL statements in a single transaction. It needs a list of the affected tables to be able to detect and show the changes.

mysql_sync:
  sql:  
    DELETE FROM rules WHERE name LIKE 'CMDB:%';
    INSERT INTO rules (name, rule) VALUES
      ('CMDB: check for cats', ':is(object, "CAT")'),
      ('CMDB: check for dogs', ':is(object, "DOG")');
    REPLACE INTO webhooks (name, url) VALUES
      ('OpsGenie', 'https://opsgenie/something/token'),
      ('Slack', 'https://slack/something/token');
  user: monitoring
  password: Yooghah5
  database: monitoring
  tables:
    - rules
    - webhooks

Prerequisites The module does not enforce idempotency, but it is expected you provide appropriate SQL queries. In the above example, idempotency is achieved because the content of the `rules` table is deleted and recreated from scratch while the rows in the `webhooks` table are replaced if they already exist. You need the PyMySQL package.

Module definition Starting from the skeleton described in the previous article, here is the module definition:

module_args = dict(
    sql=dict(type='str', required=True),
    user=dict(type='str', required=True),
    password=dict(type='str', required=True, no_log=True),
    database=dict(type='str', required=True),
    tables=dict(type='list', required=True, elements='str'),
)
result = dict(
    changed=False
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)

The password is marked with no_log to ensure it won t be displayed or stored, notably when ansible-playbook runs in verbose mode. There is no host option as the module is executed on the MySQL host. Strong authentication using certificates is not implemented either. This matches our goal with custom modules: only implement what you strictly need.

Getting the current rows The next step is to retrieve the records currently in the database. The got dictionary is a mapping from table names to the list of rows they contain:

got =  
tables = module.params['tables']
connection = pymysql.connect(
    user=module.params['user'],
    password=module.params['password'],
    db=module.params['database'],
    charset='utf8mb4',
    cursorclass=pymysql.cursors.DictCursor
)
with connection.cursor() as cursor:
    for table in tables:
        cursor.execute("SELECT * FROM  ".format(table))
        got[table] = cursor.fetchall()

Computing the changes Let s now build the wanted dictionary. The trick is to execute the SQL statements in a transaction without issuing a final commit. The changes will be invisible¹ to other readers and we can compare the final rows with the rows collected in got:

wanted =  
sql = module.params['sql']
statements = [statement.strip()
              for statement in sql.split(";\n")
              if statement.strip()]
with connection.cursor() as cursor:
    for statement in statements:
        try:
            cursor.execute(statement)
        except pymysql.OperationalError as err:
            code, message = err.args
            result['msg'] = "MySQL error for  :  ".format(
                statement,
                message)
            module.fail_json(**result)
    for table in tables:
        cursor.execute("SELECT * FROM  ".format(table))
        wanted[table] = cursor.fetchall()

The first for loop executes each statement. On error, we return a helpful message containing the faulty one. The second for loop records the final rows of each table in wanted.

Applying changes Back to the skeleton described in the previous article, the last step is to apply the changes if there is a difference between got and wanted when not running with check mode. The diff object is a bit more elaborate as it is built table by table. This enables Ansible to display the name of each table before the diff representation:

if got != wanted:
    result['changed'] = True
    result['diff'] = [dict(
        before_header=table,
        after_header=table,
        before=yaml.safe_dump(got[table]),
        after=yaml.safe_dump(wanted[table]))
                      for table in tables
                      if got[table] != wanted[table]]
if module.check_mode or not result['changed']:
    module.exit_json(**result)

Applying the changes is quite trivial: just commit them! Otherwise, they are lost when the module exits.

connection.commit()

The complete code is available on GitHub. Compared to the mysql_query module, this one supports the check mode, signals correctly if there is a change and displays the differences. However, it should not be used with huge tables, as it would try to load them in memory.

The tables need to use the InnoDB storage engine. Moreover, MySQL does not know how to use transactions with DDL statements: do not modify table definitions!

Vincent Bernat: Syncing SSH keys on Cisco IOS-XR with a custom Ansible module

The cisco.iosxr collection from Ansible Galaxy provides an iosxr_user module to manage local users, along with their SSH keys. However, the module is quite slow, do not display a diff for changed SSH keys, never signal change when a key is modified, and does not delete obsolete keys. Let s write a custom Ansible module managing only the SSH keys while fixing these issues.

Notice I recommend that you read Writing a custom Ansible module as an introduction.

How to add an SSH key to a user Adding SSH keys to users in Cisco IOS-XR is quite undocumented. First, you need to encode the key with the ssh-rsa key ASN.1 format, like an OpenSSH public key, but without the base64-encoding:

$ awk ' print $2 ' id_rsa.pub \
      base64 -d \
    > publickey_vincent.raw

Then, you upload the key with SCP to harddisk:/publickey_vincent.raw and import it for the current user with the following IOS command:

crypto key import authentication rsa harddisk:/publickey_vincent.b64

However, if you want to import a key for another user, you need to be part of the root-system group:

username vincent
 group root-lr
 group root-system

With the following admin command, you can attach a key to another user:

admin crypto key import authentication rsa username cedric harddisk:/publickey_cedric.b64

Code The module has the following signature and it installs the specified key for each user and remove keys from retired users the ones we do not specify.

iosxr_users:
  keys:
    vincent: ssh-rsa AAAAB3NzaC1yc2EAA[ ]ymh+YrVWLZMJR
    cedric:  ssh-rsa AAAAB3NzaC1yc2EAA[ ]RShPA8w/8eC0n

Prerequisites Unlike the `iosxr_user` module, our custom module only handles SSH keys, one per user. Therefore, the user definitions have to already exist in the running configuration.¹ Moreover, the user defined in `ansible_user` needs to be in the `root-system` group. The `cisco.iosxr` collection must also be installed as the module relies on its code. When running the module, `ansible_connection` needs to be set to `network_cli` and `ansible_network_os` to `iosxr`. These variables are usually defined in the inventory.

Module definition Starting from the skeleton described in the previous article, we define the module:

module_args = dict(
    keys=dict(type='dict', elements='str', required=True),
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)
result = dict(
    changed=False
)

Getting the installed keys The next step is to retrieve the keys currently installed. This can be done with the following command:

# show crypto key authentication rsa all
Key label: vincent
Type     : RSA public key authentication
Size     : 2048
Imported : 16:17:08 UTC Tue Aug 11 2020
Data     :
 30820122 300D0609 2A864886 F70D0101 01050003 82010F00 3082010A 02820101
 00D81E5B A73D82F3 77B1E4B5 949FB245 60FB9167 7CD03AB7 ADDE7AFE A0B83174
 A33EC0E6 1C887E02 2338367A 8A1DB0CE 0C3FBC51 15723AEB 07F301A4 B1A9961A
 2D00DBBD 2ABFC831 B0B25932 05B3BC30 B9514EA1 3DC22CBD DDCA6F02 026DBBB6
 EE3CFADA AFA86F52 CAE7620D 17C3582B 4422D24F D68698A5 52ED1E9E 8E41F062
 7DE81015 F33AD486 C14D0BB1 68C65259 F9FD8A37 8DE52ED0 7B36E005 8C58516B
 7EA6C29A EEE0833B 42714618 50B3FFAC 15DBE3EF 8DA5D337 68DAECB9 904DE520
 2D627CEA 67E6434F E974CF6D 952AB2AB F074FBA3 3FB9B9CC A0CD0ADC 6E0CDB2A
 6A1CFEBA E97AF5A9 1FE41F6C 92E1F522 673E1A5F 69C68E11 4A13C0F3 0FFC782D
 27020301 0001
[ ]

ansible_collections.cisco.iosxr.plugins.module_utils.network.iosxr.iosxr contains a run_commands() function we can use:

command = "show crypto key authentication rsa all"
out = run_commands(module, command)
out = out[0].replace(' \n', '\n')

A common library to parse a command output is textfsm: a Python module using a template-based state machine for parsing semi-formatted text.

template = r"""
Value Required Label (\w+)
Value Required,List Data ([A-F0-9 ]+)
Start
 ^Key label: $ Label 
 ^Data\s+: -> GetData
GetData
 ^ $ Data 
 ^$$ -> Record Start
""".lstrip()
re_table = textfsm.TextFSM(io.StringIO(template))
got =  data[0]: "".join(data[1]).replace(' ', '')
       for data in re_table.ParseText(out)

got is a dictionary associating key labels, considered as usernames, with a hexadecimal representation of the public key currently installed. It looks like this:

>>> pprint(got)
 'alfred': '30820122300D0609[ ]6F0203010001',
 'cedric': '30820122300D0609[ ]710203010001',
 'vincent': '30820122300D0609[ ]270203010001'

Comparing with the wanted keys Let s now build the wanted dictionary using the same structure. In module.params['keys'], we have a dictionary associating usernames to public SSH keys in the OpenSSH format:

>>> pprint(module.params['keys'])
 'cedric': 'ssh-rsa AAAAB3NzaC1yc2[ ]',
 'vincent': 'ssh-rsa AAAAB3NzaC1yc2[ ]'

We need to convert these keys in the same hexadecimal representation used by Cisco above. The ssh-keygen command and some glue can do the conversion:²

$ ssh-keygen -f id_rsa.pub -e -mPKCS8 \
     grep -v '^---' \
     base64 -d \
     hexdump -e '4/1 "%0.2X"'
30820122300D06092[ ]782D270203010001

Assuming we have a ssh2cisco() function doing that, we can build the wanted dictionary:

wanted =  k: ssh2cisco(v)
          for k, v in module.params['keys'].items()

Applying changes Back to the skeleton described in the previous article, the last step is to apply the changes if there is a difference between got and wanted when not running with check mode. The part comparing got and wanted is taken verbatim from the skeleton module:

if got != wanted:
    result['changed'] = True
    result['diff'] = dict(
        before=yaml.safe_dump(got),
        after=yaml.safe_dump(wanted)
    )
if module.check_mode or not result['changed']:
    module.exit_json(**result)

Let s copy the new or changed keys and attach them to their respective users. For this purpose, we reuse the get_connection() and copy_file() functions from ansible_collections.cisco.iosxr.plugins.module_utils.network.iosxr.iosxr.

conn = get_connection(module)
for user in wanted:
    if user not in got or wanted[user] != got[user]:
        dst = f"/harddisk:/publickey_ user .raw"
        with tempfile.NamedTemporaryFile() as src:
            decoded = base64.b64decode(
                module.params['keys'][user].split()[1])
            src.write(decoded)
            src.flush()
            copy_file(module, src.name, dst)
    command = ("admin crypto key import authentication rsa "
               f"username  user   dst ")
    conn.send_command(command, prompt="yes/no", answer="yes")

Then, we remove obsolete keys:

for user in got:
    if user not in wanted:
        command = ("admin crypto key zeroize authentication rsa "
                   f"username  user ")
        conn.send_command(command, prompt="yes/no", answer="yes")

The complete code is available on GitHub. Compared to the iosxr_user module, this one displays a diff when running with --diff, correctly signals a change, is faster, ³ and deletes unwanted SSH keys. However, it is unable to create users and cannot configure passwords or multiple SSH keys.

In our environment, the Ansible playbook pushes a full configuration, including the user definitions. Then, it synchronizes the SSH keys.
Despite the argument provided to ssh-keygen, the format used by Cisco is not PKCS#8. This is the ASN.1 representation of a Subject Public Key Info structure, as defined in RFC 2459. Moreover, PKCS#8 is a format for a private key, not a public one.
The main factors for being faster are:
- not creating users, and
- not reuploading existing SSH keys.

Vincent Bernat: Writing a custom Ansible module

Ansible ships a lot of modules you can combine for your configuration management needs. However, the quality of these modules may vary widely. Sometimes, it may be quicker and more robust to write your own module instead of shopping and assembling existing ones.¹ In my opinion, a robust module exhibits the following characteristics:

idempotency,
diff support,
check mode compatibility,
correct change signaling, and
lifecycle management.

In a nutshell, it means the module can run with --diff --check and shows the changes it would apply. When run twice in a row, the second run won t apply or signal changes. The last bullet point suggests the module should be able to delete outdated objects configured during previous runs.² The module code should be minimal and tailored to your needs. Making the module generic for use by other users is a non-goal. Less code usually means less bugs and easier to understand. I do not cover testing here. It is undeniably a good practice, but it requires a significant effort. In my opinion, it is preferable to have a well written module matching the above characteristics rather than a module that is well tested but without them or a module requiring further (untested) assembly to meet your needs.

Module skeleton Ansible documentation contains instructions to build a module, along with some best practices. As one of our non-goal is to distribute it, we choose to take some shortcuts and skip some of the boilerplate. Let s assume we build a module with the following signature:

custom_module:
  user: someone
  password: something
  data: "some random string"

There are various locations you can put a module in Ansible. A common possibility is to include it into a role. In a library/ subdirectory, create an empty __init__.py file and a custom_module.py file with the following code:³

#!/usr/bin/python
import yaml
from ansible.module_utils.basic import AnsibleModule
def main():
    # Define options accepted by the module.  
    module_args = dict(
        user=dict(type='str', required=True),
        password=dict(type='str', required=True, no_log=True),
        data=dict(type='str', required=True),
    )
    module = AnsibleModule(
        argument_spec=module_args,
        supports_check_mode=True
    )
    result = dict(
        changed=False
    )
    got =  
    wanted =  
    # Populate both  got  and  wanted .  
    # [...]
    if got != wanted:
        result['changed'] = True
        result['diff'] = dict(
            before=yaml.safe_dump(got),
            after=yaml.safe_dump(wanted)
        )
    if module.check_mode or not result['changed']:
        module.exit_json(**result)
    # Apply changes.  
    # [...]
    module.exit_json(**result)
if __name__ == '__main__':
    main()

The first part, in , defines the module, with the accepted options. Refer to the documentation on argument_spec for more details. The second part, in , builds the got and wanted variables. got is the current state while wanted is the target state. For example, if you need to modify records in a database server, got would be the current rows while wanted would be the modified rows. Then, we compare got and wanted. If there is a difference, changed is switched to True and we prepare the diff object. Ansible uses it to display the differences between the states. If we are running in check mode or if no change is detected, we stop here. The last part, in , applies the changes. Usually, it means iterating over the two structures to detect the differences and create the missing items, delete the unwanted ones and update the existing ones.

Documentation Ansible provides a fairly complete page on how to document a module. I advise you to take a more minimal approach by only documenting each option sparingly,⁴ skipping the examples and only documenting return values if it needs to. I usually limit myself to something like this:

DOCUMENTATION = """
---
module: custom_module.py
short_description: Pass provided data to remote service
description:
  - Mention anything useful for your workmate.
  - Also mention anything you want to remember in 6 months.
options:
  user:
    description:
      - user to identify to remote service
  password:
    description:
      - password for authentication to remote service
  data:
    description:
      - data to send to remote service
"""

Error handling If you run into an error, you can stop the execution with module.fail_json():

module.fail_json(
    msg=f"remote service answered with  code :  message ",
    **result
)

There is no requirement to intercept all errors. Sometimes, not swallowing an exception provides better information than replacing it with a generic message.

Returning additional values A module may return additional information that can be captured to be used in another task through the `register` directive. For this purpose, you can add arbitrary fields to the `result` dictionary. Have a look at the documentation for common return values. You should try to add these fields before exiting the module when in check mode. The returned values can be documented.

Examples Here are several examples of custom modules following the previous skeleton. Each example highlight why a custom module was written instead of assembling existing modules.

Syncing SSH keys on Cisco IOS-XR

Syncing MySQL tables

Syncing NetBox

Syncing RIPE, ARIN and APNIC objects (not released yet)

Syncing GCP IPsec VPNs (not done yet)

Also, when using modules from Ansible Galaxy, you introduce a dependency to a third-party. This is not something that should be decided lightly: it may break later, it may only meet 80% of the needs, it may add bugs.

Some declarative systems, like Terraform, exhibits all these behaviors.

Do not worry about the shebang. It is hardcoded to `/usr/bin/python`. Ansible will modify it to match the chosen interpreter on the remote host. You can write Python 3 code if `ansible_python_interpreter` evaluates to a Python 3 interpreter.

The main issue I have with this non-programmatic approach to documentation is that it partly repeats the information contained in `argument_spec`. I think an auto-documenting structure would avoid this.

23 August 2020

Vincent Bernat: Zero-Touch Provisioning for Cisco IOS

The official documentation to automatically upgrade and configure on first boot a Cisco switch running on IOS, like a Cisco Catalyst 2960-X Series switch, is scarce on details. This note explains how to configure the ISC DHCP Server for this purpose.

When booting for the first time, Cisco IOS sends a DHCP request on all ports:

Dynamic Host Configuration Protocol (Discover)
    Message type: Boot Request (1)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x0000117c
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 0.0.0.0
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: Cisco_6c:12:c0 (b4:14:89:6c:12:c0)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name not given
    Magic cookie: DHCP
    Option: (53) DHCP Message Type (Discover)
    Option: (57) Maximum DHCP Message Size
    Option: (61) Client identifier
        Length: 25
        Type: 0
        Client Identifier: cisco-b414.896c.12c0-Vl1
    Option: (55) Parameter Request List
        Length: 12
        Parameter Request List Item: (1) Subnet Mask
        Parameter Request List Item: (66) TFTP Server Name
        Parameter Request List Item: (6) Domain Name Server
        Parameter Request List Item: (15) Domain Name
        Parameter Request List Item: (44) NetBIOS over TCP/IP Name Server
        Parameter Request List Item: (3) Router
        Parameter Request List Item: (67) Bootfile name
        Parameter Request List Item: (12) Host Name
        Parameter Request List Item: (33) Static Route
        Parameter Request List Item: (150) TFTP Server Address
        Parameter Request List Item: (43) Vendor-Specific Information
        Parameter Request List Item: (125) V-I Vendor-specific Information
    Option: (255) End

It requests a number of options, including the Bootfile name option 67, the TFTP server address option 150 and the Vendor-Identifying Vendor-Specific Information Option 125 or VIVSO. Option 67 provides the name of the configuration file located on the TFTP server identified by option 150. Option 125 includes the name of the file describing the Cisco IOS image to use to upgrade the switch. This file only contains the name of the tarball embedding the image.¹ Configuring the ISC DHCP Server to answer with the TFTP server address and the name of the configuration file is simple enough:

filename "ob2-p2.example.com";
option tftp-server-address 172.16.15.253;

However, if you want to also provide the image for upgrade, you have to specify a hexadecimal-encoded string:²

option vivso 00:00:00:09:24:05:22:63:32:39:36:30:2d:6c:61:6e:62:61:73:65:6b:39:2d:74:61:72:2e:31:35:30:2d:32:2e:53:45:31:31:2e:74:78:74;

Having a large hexadecimal-encoded string inside a configuration file is quite unsatisfying. Instead, the ISC DHCP Server allows you to express this information in a more readable way using the

option
space

statement:

# Create option space for Cisco and encapsulate it in VIVSO/vendor space
option space cisco code width 1 length width 1;
option cisco.auto-update-image code 5 = text;
option vendor.cisco code 9 = encapsulate cisco;
# Image description for Cisco IOS ZTP
option cisco.auto-update-image = "c2960-lanbasek9-tar.150-2.SE11.txt";
# Workaround for VIVSO option 125 not being sent
option vendor.iana code 0 = string;
option vendor.iana = 01:01:01;

Without the workaround mentioned in the last block, the ISC DHCP Server would not send back option 125. With such a configuration, it returns the following answer, including a harmless additional enterprise 0 encapsulated into option 125:

Dynamic Host Configuration Protocol (Offer)
    Message type: Boot Reply (2)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x0000117c
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 172.16.15.6
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: Cisco_6c:12:c0 (b4:14:89:6c:12:c0)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name: ob2-p2.example.com
    Magic cookie: DHCP
    Option: (53) DHCP Message Type (Offer)
    Option: (54) DHCP Server Identifier (172.16.15.252)
    Option: (51) IP Address Lease Time
    Option: (1) Subnet Mask (255.255.248.0)
    Option: (6) Domain Name Server
    Option: (3) Router
    Option: (150) TFTP Server Address
        Length: 4
        TFTP Server Address: 172.16.15.252
    Option: (125) V-I Vendor-specific Information
        Length: 49
        Enterprise: Reserved (0)
        Enterprise: ciscoSystems (9)
            Length: 36
            Option 125 Suboption: 5
                Length: 34
                Data: 63323936302d6c616e626173656b392d7461722e3135302d 
    Option: (255) End

The reason of this indirection is still puzzling me. I suppose it could be because updating the image name directly in option 125 is quite a hassle.
It contains the following information:
- 0x00000009: Cisco s Enterprise Number,
- 0x24: length of the enclosed data,
- 0x05: Cisco s auto-update sub-option,
- 0x22: length of the sub-option data, and
- filename of the image description (c2960-lanbasek9-tar.150-2.SE11.txt).

18 July 2020

Ritesh Raj Sarraf: Laptop Mode Tools 1.74

Laptop Mode Tools 1.74 Laptop Mode Tools version `1.74` has been released. This release includes `important bug fixes`, some `defaults settings` updated to current driver support in Linux and support for devices with `nouveau` based nVIDIA cards. A filtered list of changes is mentioned below. For the full log, please refer to the git repository

1.74 - Sat Jul 18 19:10:40 IST 2020

* With 4.15+ kernels, Linux Intel SATA has a better link power
  saving policy, med_power_with_dipm, which should be the recommended
  one to use
* Disable defaults for syslog logging
* Initialize LM_VERBOSE with default to disabled
* Merge pull request #157 from rickysarraf/nouveau
* Add power saving module for nouveau cards
* Disable ethernet module by default
* Add board-specific folder and documentation
* Add execute bit on module radeon-dpm
* Drop unlock because there is no lock acquired

Resources

Source tarball, Feodra/SUSE RPM Packages available at project s release page

Debian packages will be available soon in Unstable.

Homepage: https://github.com/rickysarraf/laptop-mode-tools/wiki

Mailing List: https://groups.google.com/d/forum/laptop-mode-tools

What is Laptop Mode Tools

Description: Tools for Power Savings based on battery/AC status
 Laptop mode is a Linux kernel feature that allows your laptop to save
 considerable power, by allowing the hard drive to spin down for longer
 periods of time. This package contains the userland scripts that are
 needed to enable laptop mode.
 .
 It includes support for automatically enabling laptop mode when the
 computer is working on batteries. It also supports various other power
 management features, such as starting and stopping daemons depending on
 power mode, automatically hibernating if battery levels are too low, and
 adjusting terminal blanking and X11 screen blanking
 .
 laptop-mode-tools uses the Linux kernel's Laptop Mode feature and thus
 is also used on Desktops and Servers to conserve power

5 April 2020

Vincent Bernat: Safer SSH agent forwarding

ssh-agent is a program to hold in memory the private keys used by SSH for public-key authentication. When the agent is running, ssh forwards to it the signature requests from the server. The agent performs the private key operations and returns the results to ssh. It is useful if you keep your private keys encrypted on disk and you don t want to type the password at each connection. Keeping the agent secure is critical: someone able to communicate with the agent can authenticate on your behalf on remote servers. ssh also provides the ability to forward the agent to a remote server. From this remote server, you can authenticate to another server using your local agent, without copying your private key on the intermediate server. As stated in the manual page, this is dangerous!

Agent forwarding should be enabled with caution. Users with the ability to bypass file permissions on the remote host (for the agent s UNIX-domain socket) can access the local agent through the forwarded connection. An attacker cannot obtain key material from the agent, however they can perform operations on the keys that enable them to authenticate using the identities loaded into the agent. A safer alternative may be to use a jump host (see -J).

As mentioned, a better alternative is to use the jump host feature: the SSH connection to the target host is tunneled through the SSH connection to the jump host. See the manual page and this blog post for more details.

If you really need to use SSH agent forwarding, you can secure it a bit through a dedicated agent with two main attributes:

it holds only the private key to connect to the target host, and
it asks confirmation for each requested signature.

The following alias around the ssh command will spawn such an ephemeral agent:

alias assh="ssh-agent ssh -o AddKeysToAgent=confirm -o ForwardAgent=yes"

With the -o AddKeysToAgent=confirm directive, ssh adds the unencrypted private key to the agent but each use must be confirmed.¹ Once connected, you get a password prompt for each signature request:²

ssh-agent prompt confirmation with fingerprint and yes/no buttons

But, again, avoid using agent forwarding!

Update (2020-04) In a previous version of this article, the wrapper around the ssh command was a more complex function. Alexandre Oliva was kind enough to point me to the simpler solution above.

Update (2020-04) Guardian Agent is an even safer alternative: it shows and ensures the usage (target and command) of the requested signature. There is also a wide range of alternative solutions to this problem. See for example SSH-Ident, Wikimedia solution and solo-agent.

Alternatively, you can add the keys with ssh-add -c.
Unfortunately, the dialog box default answer is Yes.

9 October 2017

Markus Koschany: My Free Software Activities in September 2017

Welcome to gambaru.de. Here is my monthly report that covers what I have been doing for Debian. If you re interested in Java, Games and LTS topics, this might be interesting for you. Debian Games

I sponsored a new release of hexalate for Unit193 and icebreaker for Andreas Gnau. The latter is a reintroduction.
New upstream releases this month: freeorion and hyperrogue.
I backported freeciv and freeorion to Stretch.

Debian Java

New upstream releases and one update: sweethome3d, sweethome3d-furniture, sweethome3d-furniture-editor, sweethome3d-textures-editor (update), libsambox-java, libsejda-java, pdfsam, easymock, jboss-modules, jboss-xnio and undertow.
I fixed one RC bug in libsejda-io-java (#874494) and investigated another one (#869266) in commons-httpclient which could be closed.
The new build-dependencies of jboss-xnio, wildfly-client-config and wildfly-common, were accepted into the archive this month.
I spent some quality time on fixing #874579 in libhibernate-validator-java. This was the last blocking bug for pdfsam which I could finally upload to unstable. It s a really great JavaFX application. Check it out!
I sponsored another update of libimglib2-java for Ghislain Vaillant and simplyhtml, freeplane and knopflerfish-osgi for Felix Natter.
I also fixed RC bug #871348 in robocode, a Java programming game and #871347 in tycho.

Debian LTS This was my nineteenth month as a paid contributor and I have been paid to work 15,75 hours on Debian LTS, a project started by Rapha l Hertzog. In that time I did the following:

From 18. September to 24. September I was in charge of our LTS frontdesk. I triaged bugs in poppler, binutils, kannel, wordpress, libsndfile, libexif, nautilus, libstruts1.2-java, nvidia-graphics-drivers, p3scan, otrs2 and glassfish.
DLA-1108-1. Issued a security update for tomcat7 fixing 1 CVE.
DLA-1116-1. Issued a security update for poppler fixing 3 CVE.
DLA-1119-1. Issued a security update for otrs2 fixing 4 CVE.
DLA-1122-1. Issued a security update for asterisk fixing 1 CVE. I also investigated CVE-2017-14099 and CVE-2017-14603. I decided against a backport because the fix was too intrusive and the vulnerable option is disabled by default in Wheezy s version which makes it a minor issue for most users.
I submitted a patch for Debian s reportbug tool. (#878088) During our LTS BoF at DebConf 17 we came to the conclusion that we should implement a feature in reportbug that checks whether the bug reporter wants to report a regression for a recent security update. Usually the LTS and security teams receive word from the maintainer or users who report issues directly to our mailing lists or IRC channels. However in some cases we were not informed about possible regressions and the new feature in reportbug shall ensure that we can respond faster to such reports.
I started to investigate the open security issues in wordpress and will complete the work in October.

Misc

I packaged a new version of xarchiver. Thanks to the work of Ingo Br ckl xarchiver can handle almost all archive formats in Debian now.

QA upload

I did a QA upload of xball, an ancient game from the 90ies that simulates bouncing balls. It should be ready for another decade at least.

Thanks for reading and see you next time.

22 September 2017

Enrico Zini: Systemd on the command line

These are the notes of a training course on systemd I gave as part of my work with Truelite. Exploring the state of a system

systemctl status [unitname [unitname..]] show status of one or more units, or of the whole system. Glob patterns also work: systemctl status "systemd-fsck@*"
systemctl list-units or just systemctl show a table with all units, their status and their description
systemctl list-sockets lists listening sockets managed by systemd and what they activate
systemctl list-timers lists timer-activated units, with information about when they last ran and when they will run again
systemctl is-active [pattern] checks if one or more units are in active state
systemctl is-enabled [pattern] checks if one or more units are enabled
systemctl is-failed [pattern] checks if one or more units are in failed state
systemctl list-dependencies [unitname] lists the dependencies of a unit, or a system-wide dependency tree
systemctl is-system-running check if the system is running correctly, or if some unit is in a failed state
systemd-cgtop like top but processes are aggregated by unit
systemd-analyze produces reports on boot time, per-unit boot time charts, dependency graphs, and more

Start and stop services Similar to the System V service command, systemctl provides commands to start/stop/restart/reload units or services:

start: starts a unit if it is not already started
stop: stops a unit
restart: starts or restarts a unit
reload: tell a unit to reload its configuration (if it supports it)
try-restart: restarts a unit only if it is already active, otherwise do nothing, to prevent accidentally starting a service
reload-or-restart: tell a unit to reload its configuration if supported, otherwise restart it
try-reload-or-restart: tell a unit to reload its configuration if supported, otherwise restart it. If the unit is not already active, do nothing to prevent accidentally starting a service.

Changing global system state systemctl has halt, poweroff, reboot, suspend, hibernate, and hybrid-sleep commands to tell systemd to reboot, power off, suspend and so on. kexec and switch-root also work. The rescue and emergency commands switch the system to rescue and emergency mode (see man systemd.special. systemctl default switches to the default mode, which also happens when exiting the rescue or emergency shell. Run services at boot systemd does not implement runlevels, and services start at boot based on their dependencies. To start a service at boot, you add to its .service file a WantedBy= dependency on a well-known .target unit. At boot, systemd brings up the whole chain of dependency started from a default unit, and that will eventually activate also your service. See systemctl get-default for what unit is currently the default in your system. You can change it via the systemd.unit= kernel command line, so you can configure multiple entries in the boot loader that boot the system running different services. For example systemd.unit=rescue.target for a rescue mode, systemd.unit=multi-user.target for a non-graphical mode, or add your own .target file to implement new system modes. See systemctl list-units -t target --all for a list of all currently available targets in your system.

systemctl enable unitname enables the unit to start at boot, by creating symlinks to it in the .wants directory of the units listed in its WantedBy= configuration
systemctl disable unitname removes the symlinks created by enable
systemctl reenable unitname removes and readds the symlinks for when you changed WantedBy=

Notes: systemctl start activates a unit right now, but does not automatically enable it at boot systemctl enable enables a unit at boot, but does not automatically start it right now * a disabled unit can still be activated if another unit depends on it To disable a unit so that it will never get started even if another unit depends on it, use systemctl mask unitname. Use systemctl unmask unitname to undo the masking. Reloading / restarting systemd systemctl daemon-reload tells systemd to reload its configuration. systemctl daemon-reexec tells systemd to restart iself.

13 September 2017

Vincent Bernat: Route-based IPsec VPN on Linux with strongSwan

A common way to establish an IPsec tunnel on Linux is to use an IKE daemon, like the one from the strongSwan project, with a minimal configuration¹:

conn V2-1
  left        = 2001:db8:1::1
  leftsubnet  = 2001:db8:a1::/64
  right       = 2001:db8:2::1
  rightsubnet = 2001:db8:a2::/64
  authby      = psk
  auto        = route

The same configuration can be used on both sides. Each side will figure out if it is left or right . The IPsec site-to-site tunnel endpoints are 2001:db8: 1::1 and 2001:db8: 2::1. The protected subnets are 2001:db8: a1::/64 and 2001:db8: a2::/64. As a result, strongSwan configures the following policies in the kernel:

$ ip xfrm policy
src 2001:db8:a1::/64 dst 2001:db8:a2::/64
        dir out priority 399999 ptype main
        tmpl src 2001:db8:1::1 dst 2001:db8:2::1
                proto esp reqid 4 mode tunnel
src 2001:db8:a2::/64 dst 2001:db8:a1::/64
        dir fwd priority 399999 ptype main
        tmpl src 2001:db8:2::1 dst 2001:db8:1::1
                proto esp reqid 4 mode tunnel
src 2001:db8:a2::/64 dst 2001:db8:a1::/64
        dir in priority 399999 ptype main
        tmpl src 2001:db8:2::1 dst 2001:db8:1::1
                proto esp reqid 4 mode tunnel
[ ]

This kind of IPsec tunnel is a policy-based VPN: encapsulation and decapsulation are governed by these policies. Each of them contains the following elements:

a direction (out, in or fwd²),
a selector (source subnet, destination subnet, protocol, ports),
a mode (transport or tunnel),
an encapsulation protocol (esp or ah), and
the endpoint source and destination addresses.

When a matching policy is found, the kernel will look for a corresponding security association (using reqid and the endpoint source and destination addresses):

$ ip xfrm state
src 2001:db8:1::1 dst 2001:db8:2::1
        proto esp spi 0xc1890b6e reqid 4 mode tunnel
        replay-window 0 flag af-unspec
        auth-trunc hmac(sha256) 0x5b68[ ]8ba2904 128
        enc cbc(aes) 0x8e0e377ad8fd91e8553648340ff0fa06
        anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
[ ]

If no security association is found, the packet is put on hold and the IKE daemon is asked to negotiate an appropriate one. Otherwise, the packet is encapsulated. The receiving end identifies the appropriate security association using the SPI in the header. Two security associations are needed to establish a bidirectionnal tunnel:

$ tcpdump -pni eth0 -c2 -s0 esp
13:07:30.871150 IP6 2001:db8:1::1 > 2001:db8:2::1: ESP(spi=0xc1890b6e,seq=0x222)
13:07:30.872297 IP6 2001:db8:2::1 > 2001:db8:1::1: ESP(spi=0xcf2426b6,seq=0x204)

All IPsec implementations are compatible with policy-based VPNs. However, some configurations are difficult to implement. For example, consider the following proposition for redundant site-to-site VPNs: Redundant VPNs between 3 sites

A possible configuration between V1-1 and V2-1 could be:

conn V1-1-to-V2-1
  left        = 2001:db8:1::1
  leftsubnet  = 2001:db8:a1::/64,2001:db8:a6::cc:1/128,2001:db8:a6::cc:5/128
  right       = 2001:db8:2::1
  rightsubnet = 2001:db8:a2::/64,2001:db8:a6::/64,2001:db8:a8::/64
  authby      = psk
  keyexchange = ikev2
  auto        = route

Each time a subnet is modified on one site, the configurations need to be updated on all sites. Moreover, overlapping subnets (2001:db8: a6::/64 on one side and 2001:db8: a6::cc:1/128 at the other) can also be problematic. The alternative is to use route-based VPNs: any packet traversing a pseudo-interface will be encapsulated using a security policy bound to the interface. This brings two features:

Routing daemons can be used to distribute routes to be protected by the VPN. This decreases the administrative burden when many subnets are present on each side.
Encapsulation and decapsulation can be executed in a different routing instance or namespace. This enables a clean separation between a private routing instance (where VPN users are) and a public routing instance (where VPN endpoints are).

Route-based VPN on Juniper Before looking at how to achieve that on Linux, let s have a look at the way it works with a JunOS-based platform (like a Juniper vSRX). This platform as long-standing history of supporting route-based VPNs (a feature already present in the Netscreen ISG platform). Let s assume we want to configure the IPsec VPN from V3-2 to V1-1. First, we need to configure the tunnel interface and bind it to the private routing instance containing only internal routes (with IPv4, they would have been RFC 1918 routes):

interfaces  
    st0  
        unit 1  
            family inet6  
                address 2001:db8:ff::7/127;
             
         
     
 
routing-instances  
    private  
        instance-type virtual-router;
        interface st0.1;

The second step is to configure the VPN:

security  
    /* Phase 1 configuration */
    ike  
        proposal IKE-P1  
            authentication-method pre-shared-keys;
            dh-group group20;
            encryption-algorithm aes-256-gcm;
         
        policy IKE-V1-1  
            mode main;
            proposals IKE-P1;
            pre-shared-key ascii-text "d8bdRxaY22oH1j89Z2nATeYyrXfP9ga6xC5mi0RG1uc";
         
        gateway GW-V1-1  
            ike-policy IKE-V1-1;
            address 2001:db8:1::1;
            external-interface lo0.1;
            general-ikeid;
            version v2-only;
         
     
    /* Phase 2 configuration */
    ipsec  
        proposal ESP-P2  
            protocol esp;
            encryption-algorithm aes-256-gcm;
         
        policy IPSEC-V1-1  
            perfect-forward-secrecy keys group20;
            proposals ESP-P2;
         
        vpn VPN-V1-1  
            bind-interface st0.1;
            df-bit copy;
            ike  
                gateway GW-V1-1;
                ipsec-policy IPSEC-V1-1;
             
            establish-tunnels on-traffic;

We get a route-based VPN because we bind the st0.1 interface to the VPN-V1-1 VPN. Once the VPN is up, any packet entering st0.1 will be encapsulated and sent to the 2001:db8: 1::1 endpoint. The last step is to configure BGP in the private routing instance to exchange routes with the remote site:

routing-instances  
    private  
        routing-options  
            router-id 1.0.3.2;
            maximum-paths 16;
         
        protocols  
            bgp  
                preference 140;
                log-updown;
                group v4-VPN  
                    type external;
                    local-as 65003;
                    hold-time 6;
                    neighbor 2001:db8:ff::6 peer-as 65001;
                    multipath;
                    export [ NEXT-HOP-SELF OUR-ROUTES NOTHING ];

The export filter OUR-ROUTES needs to select the routes to be advertised to the other peers. For example:

policy-options  
    policy-statement OUR-ROUTES  
        term 10  
            from  
                protocol ospf3;
                route-type internal;
             
            then  
                metric 0;
                accept;

The configuration needs to be repeated for the other peers. The complete version is available on GitHub. Once the BGP sessions are up, we start learning routes from the other sites. For example, here is the route for 2001:db8: a1::/64:

> show route 2001:db8:a1::/64 protocol bgp table private.inet6.0 best-path
private.inet6.0: 15 destinations, 19 routes (15 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
2001:db8:a1::/64   *[BGP/140] 01:12:32, localpref 100, from 2001:db8:ff::6
                      AS path: 65001 I, validation-state: unverified
                      to 2001:db8:ff::6 via st0.1
                    > to 2001:db8:ff::14 via st0.2

It was learnt both from V1-1 (through st0.1) and V1-2 (through st0.2). The route is part of the private routing instance but encapsulated packets are sent/received in the public routing instance. No route-leaking is needed for this configuration. The VPN cannot be used as a gateway from internal hosts to external hosts (or vice-versa). This could also have been done with JunOS security policies (stateful firewall rules) but doing the separation with routing instances also ensure routes from different domains are not mixed and a simple policy misconfiguration won t lead to a disaster.

Route-based VPN on Linux Starting from Linux 3.15, a similar configuration is possible with the help of a virtual tunnel interface³. First, we create the private namespace:

# ip netns add private
# ip netns exec private sysctl -qw net.ipv6.conf.all.forwarding=1

Any private interface needs to be moved to this namespace (no IP is configured as we can use IPv6 link-local addresses):

# ip link set netns private dev eth1
# ip link set netns private dev eth2
# ip netns exec private ip link set up dev eth1
# ip netns exec private ip link set up dev eth2

Then, we create vti6, a tunnel interface (similar to st0.1 in the JunOS example):

# ip tunnel add vti6 \
   mode vti6 \
   local 2001:db8:1::1 \
   remote 2001:db8:3::2 \
   key 6
# ip link set netns private dev vti6
# ip netns exec private ip addr add 2001:db8:ff::6/127 dev vti6
# ip netns exec private sysctl -qw net.ipv4.conf.vti6.disable_policy=1
# ip netns exec private sysctl -qw net.ipv4.conf.vti6.disable_xfrm=1
# ip netns exec private ip link set vti6 mtu 1500
# ip netns exec private ip link set vti6 up

The tunnel interface is created in the initial namespace and moved to the private one. It will remember its original namespace where it will process encapsulated packets. Any packet entering the interface will temporarily get a firewall mark of 6 that will be used only to match the appropriate IPsec policy⁴ below. The kernel sets a low MTU on the interface to handle any possible combination of ciphers and protocols. We set it to 1500 and let PMTUD do its work. We can then configure strongSwan⁵:

conn V3-2
  left        = 2001:db8:1::1
  leftsubnet  = ::/0
  right       = 2001:db8:3::2
  rightsubnet = ::/0
  authby      = psk
  mark        = 6
  auto        = route
  keyexchange = ikev2
  keyingtries = %forever
  ike         = aes256gcm16-prfsha384-ecp384!
  esp         = aes256gcm16-prfsha384-ecp384!
  mobike      = no

The IKE daemon configures the following policies in the kernel:

$ ip xfrm policy
src ::/0 dst ::/0
        dir out priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:1::1 dst 2001:db8:3::2
                proto esp reqid 1 mode tunnel
src ::/0 dst ::/0
        dir fwd priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:3::2 dst 2001:db8:1::1
                proto esp reqid 1 mode tunnel
src ::/0 dst ::/0
        dir in priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:3::2 dst 2001:db8:1::1
                proto esp reqid 1 mode tunnel
[ ]

Those policies are used for any source or destination as long as the firewall mark is equal to 6, which matches the mark configured for the tunnel interface. The last step is to configure BGP to exchange routes. We can use BIRD for this:

router id 1.0.1.1;
protocol device  
   scan time 10;
 
protocol kernel  
   persist;
   learn;
   import all;
   export all;
   merge paths yes;
 
protocol bgp IBGP_V3_2  
   local 2001:db8:ff::6 as 65001;
   neighbor 2001:db8:ff::7 as 65003;
   import all;
   export where ifname ~ "eth*";
   preference 160;
   hold time 6;

Once BIRD is started in the private namespace, we can check routes are learned correctly:

$ ip netns exec private ip -6 route show 2001:db8:a3::/64
2001:db8:a3::/64 proto bird metric 1024
        nexthop via 2001:db8:ff::5  dev vti5 weight 1
        nexthop via 2001:db8:ff::7  dev vti6 weight 1

The above route was learnt from both V3-1 (through vti5) and V3-2 (through vti6). Like for the JunOS version, there is no route-leaking between the private namespace and the initial one. The VPN cannot be used as a gateway between the two namespaces, only for encapsulation. This also prevent a misconfiguration (for example, IKE daemon not running) from allowing packets to leave the private network. As a bonus, unencrypted traffic can be observed with tcpdump on the tunnel interface:

$ ip netns exec private tcpdump -pni vti6 icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vti6, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
20:51:15.258708 IP6 2001:db8:a1::1 > 2001:db8:a3::1: ICMP6, echo request, seq 69
20:51:15.260874 IP6 2001:db8:a3::1 > 2001:db8:a1::1: ICMP6, echo reply, seq 69

You can find all the configuration files for this example on GitHub. The documentation of strongSwan also features a page about route-based VPNs.

Everything in this post should work with Libreswan.
fwd is for incoming packets on non-local addresses. It only makes sense in transport mode and is a Linux-only particularity.
Virtual tunnel interfaces (VTI) were introduced in Linux 3.6 (for IPv4) and Linux 3.12 (for IPv6). Appropriate namespace support was added in 3.15. KLIPS, an alternative out-of-tree stack available since Linux 2.2, also features tunnel interfaces.
The mark is set right before doing a policy lookup and restored after that. Consequently, it doesn t affect other possible uses (filtering, routing). However, as Netfilter can also set a mark, one should be careful for conflicts.
The ciphers used here are the strongest ones currently possible while keeping compatibility with JunOS. The documentation for strongSwan contains a complete list of supported algorithms as well as security recommendations to choose them.

12 September 2017

Markus Koschany: My Free Software Activities in August 2017

Welcome to gambaru.de. Here is my monthly report that covers what I have been doing for Debian. If you re interested in Java, Games and LTS topics, this might be interesting for you. DebConf 17 in Montreal I traveled to DebConf 17 in Montreal/Canada. I arrived on 04. August and met a lot of different people which I only knew by name so far. I think this is definitely one of the best aspects of real life meetings, putting names to faces and getting to know someone better. I totally enjoyed my stay and I would like to thank all the people who were involved in organizing this event. You rock! I also gave a talk about the The past, present and future of Debian Games , listened to numerous other talks and got a nice sunburn which luckily turned into a more brownish color when I returned home on 12. August. The only negative experience I made was with my airline which was supposed to fly me home to Frankfurt again. They decided to cancel the flight one hour before check-in for unknown reasons and just gave me a telephone number to sort things out. No support whatsoever. Fortunately (probably not for him) another DebConf attendee suffered the same fate and together we could find another flight with Royal Air Maroc the same day. And so we made a short trip to Casablanca/Morocco and eventually arrived at our final destination in Frankfurt a few hours later. So which airline should you avoid at all costs (they still haven t responded to my refund claims) ? It s WoW-Air from Iceland. (just wow) Debian Games

There were a lot of GCC-7 bugs to fix this month which claimed most of my games related time.
Bug fixes: torcs (RC #853685), berusky2 (RC #853325), lordsawar (RC #853529), simutrans(#869029), gngb (RC #853425), libclaw (RC #853488), funguloids (RC #853408), plee-the-bear (RC #853618), alien-arena (RC #871218), fretsonfire (RC #872934), minetest, (RC #873324) widelands, (RC #871114), pingus (RC #853614)
I released version 2.1 of the Debian Games Blend.
I completely overhauled the gngb package, a color gameboy emulator.
I packaged new upstream versions of freeciv, peg-e and blockattack.
I backported the memory leak fix for unknown-horizons and fife to Stretch (#871037).
I investigated some graphical glitches in Neverball which appear to be related to OpenGL and the graphic stack in general but I couldn t find an immediate solution. (#871223)

Debian Java

I sponsored libimglib2-java for Ghislain Vaillant.
New upstream releases: apktool, jboss-modules, jboss-logging-tools, jboss-logmanager.
For jboss-xnio I packaged two new build-dependencies which are wildfly-common and wildfly-client-config and they are currently waiting in the NEW queue.
The last build-dependency for PDFsam was accepted this month and I was able to upload the new version to experimental. Unfortunately the program is currently not really usable due to a bug in libhibernate-validator-java (#874579)

Debian LTS This was my eighteenth month as a paid contributor and I have been paid to work 20,25 hours on Debian LTS, a project started by Rapha l Hertzog. In that time I did the following:

From 31. July until 06. August I was in charge of our LTS frontdesk. I triaged bugs in tinyproxy, mantis, sox, timidity, ioquake3, varnish, libao, clamav, binutils, smplayer, libid3tag, mpg123 and shadow.
DLA-1064-1. Issued a security update for freeradius fixing 6 CVE.
DLA-1068-1. Issued a security update for git fixing 1 CVE.
DLA-1077-1. Issued a security update for faad2 fixing 11 CVE.
DLA-1083-1. Issued a security update for openexr fixing 3 CVE.
DLA-1095-1. Issued a security update for freerdp fixing 5 CVE.

Non-maintainer upload

I uploaded a security fix for openexr (#864078) to fix CVE-2017-9110, CVE-2017-9112 and CVE-2017-9116.

Thanks for reading and see you next time.

20 August 2017

Vincent Bernat: IPv6 route lookup on Linux

TL;DR: With its implementation of IPv6 routing tables using radix trees, Linux offers subpar performance (450 ns for a full view 40,000 routes) compared to IPv4 (50 ns for a full view 500,000 routes) but fair memory usage (20 MiB for a full view).

In a previous article, we had a look at IPv4 route lookup on Linux. Let s see how different IPv6 is.

Lookup trie implementation Looking up a prefix in a routing table comes down to find the most specific entry matching the requested destination. A common structure for this task is the trie, a tree structure where each node has its parent as prefix. With IPv4, Linux uses a level-compressed trie (or LPC-trie), providing good performances with low memory usage. For IPv6, Linux uses a more classic radix tree (or Patricia trie). There are three reasons for not sharing:

The IPv6 implementation (introduced in Linux 2.1.8, 1996) predates the IPv4 implementation based on LPC-tries (in Linux 2.6.13, commit 19baf839ff4a).
The feature set is different. Notably, IPv6 supports source-specific routing¹ (since Linux 2.1.120, 1998).
The IPv4 address space is denser than the IPv6 address space. Level-compression is therefore quite efficient with IPv4. This may not be the case with IPv6.

The trie in the below illustration encodes 6 prefixes:

For more in-depth explanation on the different ways to encode a routing table into a trie and a better understanding of radix trees, see the explanations for IPv4. The following figure shows the in-memory representation of the previous radix tree. Each node corresponds to a struct fib6_node. When a node has the RTN_RTINFO flag set, it embeds a pointer to a struct rt6_info containing information about the next-hop. Memory representation of a routing table

Memory representation of a routing table

The fib6_lookup_1() function walks the radix tree in two steps:

walking down the tree to locate the potential candidate, and
checking the candidate and, if needed, backtracking until a match.

Here is a slightly simplified version without source-specific routing:

static struct fib6_node *fib6_lookup_1(struct fib6_node *root,
                                       struct in6_addr  *addr)
 
    struct fib6_node *fn;
    __be32 dir;
    /* Step 1: locate potential candidate */
    fn = root;
    for (;;)  
        struct fib6_node *next;
        dir = addr_bit_set(addr, fn->fn_bit);
        next = dir ? fn->right : fn->left;
        if (next)  
            fn = next;
            continue;
         
        break;
     
    /* Step 2: check prefix and backtrack if needed */
    while (fn)  
        if (fn->fn_flags & RTN_RTINFO)  
            struct rt6key *key;
            key = fn->leaf->rt6i_dst;
            if (ipv6_prefix_equal(&key->addr, addr, key->plen))  
                if (fn->fn_flags & RTN_RTINFO)
                    return fn;
             
         
        if (fn->fn_flags & RTN_ROOT)
            break;
        fn = fn->parent;
     
    return NULL;

Caching While IPv4 lost its route cache in Linux 3.6 (commit 5e9965c15ba8), IPv6 still has a caching mechanism. However cache entries are directly put in the radix tree instead of a distinct structure. Since Linux 2.1.30 (1997) and until Linux 4.2 (commit 45e4fd26683c), almost any successful route lookup inserts a cache entry in the radix tree. For example, a router forwarding a ping between 2001:db8:1::1 and 2001:db8:3::1 would get those two cache entries:

$ ip -6 route show cache
2001:db8:1::1 dev r2-r1  metric 0
    cache
2001:db8:3::1 via 2001:db8:2::2 dev r2-r3  metric 0
    cache

These entries are cleaned up by the ip6_dst_gc() function controlled by the following parameters:

$ sysctl -a   grep -F net.ipv6.route
net.ipv6.route.gc_elasticity = 9
net.ipv6.route.gc_interval = 30
net.ipv6.route.gc_min_interval = 0
net.ipv6.route.gc_min_interval_ms = 500
net.ipv6.route.gc_thresh = 1024
net.ipv6.route.gc_timeout = 60
net.ipv6.route.max_size = 4096
net.ipv6.route.mtu_expires = 600

The garbage collector is triggered at most every 500 ms when there are more than 1024 entries or at least every 30 seconds. The garbage collection won t run for more than 60 seconds, except if there are more than 4096 routes. When running, it will first delete entries older than 30 seconds. If the number of cache entries is still greater than 4096, it will continue to delete more recent entries (but no more recent than 512 jiffies, which is the value of gc_elasticity) after a 500 ms pause. Starting from Linux 4.2 (commit 45e4fd26683c), only a PMTU exception would create a cache entry. A router doesn t have to handle those exceptions, so only hosts would get cache entries. And they should be pretty rare. Martin KaFai Lau explains:

Out of all IPv6 RTF_CACHE routes that are created, the percentage that has a different MTU is very small. In one of our end-user facing proxy server, only 1k out of 80k RTF_CACHE routes have a smaller MTU. For our DC traffic, there is no MTU exception.

Here is how a cache entry with a PMTU exception looks like:

$ ip -6 route show cache
2001:db8:1::50 via 2001:db8:1::13 dev out6  metric 0
    cache  expires 573sec mtu 1400 pref medium

Performance We consider three distinct scenarios:

Excerpt of an Internet full view

In this scenario, Linux acts as an edge router attached to the default-free zone. Currently, the size of such a routing table is a little bit above 40,000 routes.

/48 prefixes spread linearly with different densities

Linux acts as a core router inside a datacenter. Each customer or rack gets one or several /48 networks, which need to be routed around. With a density of 1, /48 subnets are contiguous.

/128 prefixes spread randomly in a fixed /108 subnet

Linux acts as a leaf router for a /64 subnet with hosts getting their IP using autoconfiguration. It is assumed all hosts share the same OUI and therefore, the first 40 bits are fixed. In this scenario, neighbor reachability information for the /64 subnet are converted into routes by some external process and redistributed among other routers sharing the same subnet².

Route lookup performance With the help of a small kernel module, we can accurately benchmark³ the `ip6_route_output_flags()` function and correlate the results with the radix tree size: Getting meaningful results is challenging due to the size of the address space. None of the scenarios have a fallback route and we only measure time for successful hits⁴. For the full view scenario, only the range from `2400::/16` to `2a06::/16` is scanned (it contains more than half of the routes). For the /128 scenario, the whole /108 subnet is scanned. For the /48 scenario, the range from the first /48 to the last one is scanned. For each range, 5000 addresses are picked semi-randomly. This operation is repeated until we get 5000 hits or until 1 million tests have been executed. The relation between the maximum depth and the lookup time is incomplete and I can t explain the difference of performance between the different densities of the /48 scenario. We can extract two important performance points:

With a full view, the lookup time is 450 ns. This is almost ten times the budget for forwarding at 10 Gbps which is about 50 ns.

With an almost empty routing table, the lookup time is 150 ns. This is still over the time budget for forwarding at 10 Gbps.

With IPv4, the lookup time for an almost empty table was 20 ns while the lookup time for a full view (500,000 routes) was a bit above 50 ns. How to explain such a difference? First, the maximum depth of the IPv4 LPC-trie with 500,000 routes was 6, while the maximum depth of the IPv6 radix tree for 40,000 routes is 40. Second, while both IPv4 s `fib_lookup()` and IPv6 s `ip6_route_output_flags()` functions have a fixed cost implied by the evaluation of routing rules, IPv4 has several optimizations when the rules are left unmodified⁵. Those optimizations are removed on the first modification. If we cancel those optimizations, the lookup time for IPv4 is impacted by about 30 ns. This still leaves a 100 ns difference with IPv6 to be explained. Let s compare how time is spent in each lookup function. Here is a CPU flamegraph for IPv4 s `fib_lookup()`: Only 50% of the time is spent in the actual route lookup. The remaining time is spent evaluating the routing rules (about 30 ns). This ratio is dependent on the number of routes we inserted (only 1000 in this example). It should be noted the `fib_table_lookup()` function is executed twice: once with the local routing table and once with the main routing table. The equivalent flamegraph for IPv6 s `ip6_route_output_flags()` is depicted below: Here is an approximate breakdown on the time spent:

50% is spent in the route lookup in the main table,

15% is spent in handling locking (IPv4 is using the more efficient RCU mechanism),

5% is spent in the route lookup of the local table,

most of the remaining is spent in routing rule evaluation (about 100 ns)⁶.

Why does the evaluation of routing rules is less efficient with IPv6? Again, I don t have a definitive answer.

History The following graph shows the performance progression of route lookups through Linux history: All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is able to compile older kernels as well as current ones. The kernel configuration is the default one with `CONFIG_SMP`, `CONFIG_IPV6`, `CONFIG_IPV6_MULTIPLE_TABLES` and `CONFIG_IPV6_SUBTREES` options enabled. Some other unrelated options are enabled to be able to boot them in a virtual machine and run the benchmark. There are three notable performance changes:

In Linux 3.1, Eric Dumazet delays a bit the copy of route metrics to fix the undesirable sharing of route-specific metrics by all cache entries (commit 21efcfa0ff27). Each cache entry now gets its own metrics, which explains the performance hit for the non-/128 scenarios.

In Linux 3.9, Yoshifuji Hideaki removes the reference to the neighbor entry in `struct rt6_info` (commit 887c95cc1da5). This should have lead to a performance increase. The small regression may be due to cache-related issues.

In Linux 4.2, Martin KaFai Lau prevents the creation of cache entries for most route lookups. The most sensible performance improvement comes with commit 4b32b5ad31a6. The second one is from commit 45e4fd26683c, which effectively removes creation of cache entries, except for PMTU exceptions.

Insertion performance Another interesting performance-related metric is the insertion time. Linux is able to insert a full view in less than two seconds. For some reason, the insertion time is not linear above 50,000 routes and climbs very fast to 60 seconds for 500,000 routes. Despite its more complex insertion logic, the IPv4 subsystem is able to insert 2 million routes in less than 10 seconds.

Memory usage Radix tree nodes (struct fib6_node) and routing information (

struct
rt6_info

) are allocated with the slab allocator⁷. It is therefore possible to extract the information from /proc/slabinfo when the kernel is booted with the slab_nomerge flag:

# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo   cut -f1 -d:
   name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
fib6_nodes         76101  76104     64   63    1
ip6_dst_cache      40090  40090    384   10    1

In the above example, the used memory is 76104 64+40090 384 bytes (about 20 MiB). The number of struct rt6_info matches the number of routes while the number of nodes is roughly twice the number of routes: Nodes

The memory usage is therefore quite predictable and reasonable, as even a small single-board computer can support several full views (20 MiB for each):

The LPC-trie used for IPv4 is more efficient: when 512 MiB of memory is needed for IPv6 to store 1 million routes, only 128 MiB are needed for IPv4. The difference is mainly due to the size of struct rt6_info (336 bytes) compared to the size of IPv4 s struct fib_alias (48 bytes): IPv4 puts most information about next-hops in struct fib_info structures that are shared with many entries.

Conclusion The takeaways from this article are:

upgrade to Linux 4.2 or more recent to avoid excessive caching,
route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
CONFIG_IPV6_MULTIPLE_TABLES option incurs a fixed penalty of 100 ns by lookup,
memory usage is fair (20 MiB for 40,000 routes).

Compared to IPv4, IPv6 in Linux doesn t foster the same interest, notably in term of optimizations. Hopefully, things are changing as its adoption and use at scale are increasing.

For a given destination prefix, it s possible to attach source-specific prefixes:
```
ip -6 route add 2001:db8:1::/64 \
  from 2001:db8:3::/64 \
  via fe80::1 \
  dev eth0
```
Lookup is first done on the destination address, then on the source address.
This is quite different of the classic scenario where Linux acts as a gateway for a /64 subnet. In this case, the neighbor subsystem stores the reachability information for each host and the routing table only contains a single /64 prefix.
The measurements are done in a virtual machine with one vCPU and no neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor set to performance). The benchmark is single-threaded. Many lookups are performed and the result reported is the median value. Timings of individual runs are computed from the TSC.
Most of the packets in the network are expected to be routed to a destination. However, this also means the backtracking code path is not used in the /128 and /48 scenarios. Having a fallback route gives far different results and make it difficult to ensure we explore the address space correctly.
The exact same optimizations could be applied for IPv6. Nobody did it yet.
Compiling out table support effectively removes those last 100 ns.
There is also per-CPU pointers allocated directly (4 bytes per entry per CPU on a 64-bit architecture). We ignore this detail.

2 August 2017

Markus Koschany: PDFsam: How to upgrade a Maven application for Debian

In the coming weeks and months I intend to write a mini series about packaging Java software for Debian. The following article basically starts in the middle of this journey because the PDFsam upgrade is still fresh in my mind. It requires some preexisting knowledge about build tools like Maven and some Java terminology. But do not fear. Hopefully it will make sense in the end when all pieces fall into place. A month ago I decided to upgrade PDFsam, a Java application to split, merge, extract, mix and rotate PDF documents. The current version 1.1.4 is already seven years old and uses Ant as its build system. Unfortunately up to now nobody was interested enough to invest the time to upgrade it to the latest version. A quick internet search unveils that the current sources can be found on github.com. Another brief look reveals we are dealing with a Maven project here because we can find a pom.xml file in the root directory and there is no sign of Ant s typical build.xml file anymore. Here are some general tips how to proceed from this point by using the PDFsam upgrade as an example. Find out how many new dependencies you really need The pom.xml file declares its dependencies in the <dependencies> section. It is good practice to inspect the pom.xml file and determine how much work will be required to upgrade the package. A seasoned Java packager will quickly find common dependencies like Hibernate or the Apache Commons libraries. Fortunately for you they are already packaged in Debian because a lot of projects depend on them. If you are unsure what is and what is not packaged for Debian, tracker.debian.org and codesearch.debian.net are useful tools to search for those packages. If in doubt just ask on debian-java@lists.debian.org. There is no automagical tool (yet) to find out what dependencies are really new (we talk about mh_make soon) but if you use the aforementioned tools and websites you will notice that in June 2017 one could not find the following artifacts: fontawesomefx, eventstudio, sejda-* and jackson-jr-objects. There are also jdepend and testFx but notice they are marked as <scope>test</scope> meaning they are only required if you would like to run upstream s test suite as well. For the sake of simplicity, it is best to ignore them for now and to focus on packaging only dependencies which are really needed to compile the application. Test dependencies can always be added later. This pom.xml investigation leads us to the following conclusion: PDFsam depends on Sejda, a PDF library. Basically Sejda is the product of a major refactoring that happened years ago and allows upstream to develop PDFsam faster and in multiple directions. For Debian packagers it is quite clear now that the upgrade of PDFsam is in reality more like packaging a completely new application. The inspection of Sejda s pom.xml file (another Maven project) reveals we also have to package imgscalr, Twelvemonkeys and SAMBox. We continue with these pom.xml analyses and end up with these new source packages: jackson-jr, libimgscalr-java, libsambox-java, libsejda-java, libsejda-injector-java, libsejda-io-java, libsejda-eventstudio-java, libtwelvemonkeys-java, fontawesomefx and libpdfbox2-java. Later I discovered that gettext-maven-plugin was also required. This was not obvious at first glance if you only check the pom.xml in the root directory but PDFsam and Sejda are multi-module projects! In this case every subdirectory (module) contains another pom.xml with additional information, so ideally you should check those too before you decide to start with your packaging. But don t worry it is often possible to ignore modules with a simple ignore rule inside your debian/*.poms file. The package will have less functionality but it can be still useful if you only need a subset of the modules. Of course in this case ignoring the gettext-maven-plugin artifact would result in a runtime error. C est la vie. A brief remark about Java package names: Java library packages must be named like libXXX-java. This is important for binary packages to avoid naming collisions. We are more tolerant when it comes to source package names but in general we recommend to use the exact same name as for the binary package. There are exceptions like prefixing source packages with their well known project name like jackson-XXX or jboss-XXX but this should only be used when there are already existing packages that use such a naming scheme. If in doubt, talk to us. mh_make or how to quickly generate an initial debian directory Packaging a Maven library is usually not very difficult even if it consists of multiple modules. The tricky part is to get the maven.rules, maven.IgnoreRules and your *.poms file right but debian/rules often only consists of a single dh line and the rest is finding the build-dependencies and adding them to debian/control. A small tool called mh_make, which is included in maven-debian-helper, can lend you a helping hand. The tool is not perfect yet. It requires that most build-dependencies are already installed on your local system, otherwise it won t create the initial debian directory and will only produce some unfinished (but in some cases still useful) files. A rule of thumb is to start with a package that does not depend on any other new dependency and requires the fewest build-dependencies. I have chosen libtwelvemonkeys-java because it was the simplest package and met the aforementioned criteria. Here is how mh_make looks like in action. (The animated GIF was created with Byzanz) First of all download the release tarball, unpack it and run mh_make inside the root directory.

Ok, what is happening here? First you can choose a source and binary package name. Then disable the tests and don t run javadoc to create the documentation. This will simplify things a little. Tests and javadoc settings can be added later. Choose the version you want to package and then you can basically follow the default recommendations and confirm them by hitting the Enter key. Throughout the project we choose to transform the upstream version with the symbolic debian version. Remember that Java/Maven is version-centric. This will ensure that our Maven dependencies are always satisfied later and we can simply upgrade our Maven libraries and don t have to change the versions by hand in various pom.xml files; maven-debian-helper will automatically transform them for us to debian . Enable all modules. If you choose not to, you can select each module individually. Note that later on some of the required build-dependencies cannot be found because they are either not installed (libjmagick6-java) or they cannot be found in Debian s Maven repository under /usr/share/maven-repo. You can fix this by entering a substitution rule or, as I did in this case, you can just ignore these artifacts for now. They will be added to maven.IgnoreRules. In order to successfully compile your program you have to remove them from this file later again, create the correct substitution rule in maven.rules and add the missing build-dependencies to debian/control. For now we just want to quickly create our initial debian directory. If everything went as planned a complete debian directory should be visible in your root directory. The only thing left is to fix the substitution rule for the Servlet API 3.1. Add libservlet3.1-java to Build-Depends and the following rule to maven.rules: javax.servlet s/servlet-api/javax.servlet-api/ * s/.*/3.1/ * *
s/javax.servlet/javax.servlet.jsp/ s/jsp-api/javax.servlet.jsp-api/ * s/.*/2.3/ * * The maven.rules file consists of multiple rows separated by six columns. The values represent groupId, artifactId, type, version number and two fields which I never use.

You can just use an asterisk to match any value. Every value can be substituted. This is necessary when the value of upstream s pom.xml file differs from Debian s system packages. This happens frequently for API packages which are uploaded to Maven Central multiple times under a different groupId/artifactId but provide the same features. In this case the Twelvemonkeys pom requires an older API version but Debian is already at version 3.1. Note that we require a strict version number in this case because libservlet3.1-java does not use a symbolic debian version since we provide more than one Servlet API in the archive and this measure prevents conflicts. Thanks for reading this far. More articles about Java packaging will follow in the near future and hopefully they will clarify some terms and topics which could only be briefly mentioned in this post.

before

and after

3 July 2017

Vincent Bernat: Performance progression of IPv4 route lookup on Linux

TL;DR: Each of Linux 2.6.39, 3.6 and 4.0 brings notable performance improvements for the IPv4 route lookup process.

In a previous article, I explained how Linux implements an IPv4 routing table with compressed tries to offer excellent lookup times. The following graph shows the performance progression of Linux through history: IPv4 route lookup performance

Two scenarios are tested:

500,000 routes extracted from an Internet router (half of them are /24), and
500,000 host routes (/32) tightly packed in 4 distinct subnets.

All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is able to compile older kernels¹ as well as current ones. The kernel configuration used is the default one with CONFIG_SMP and CONFIG_IP_MULTIPLE_TABLES options enabled (however, no IP rules are used). Some other unrelated options are enabled to be able to boot them in a virtual machine and run the benchmark. The measurements are done in a virtual machine with one vCPU². The host is an Intel Core i5-4670K and the CPU governor was set to performance . The benchmark is single-threaded. Implemented as a kernel module, it calls fib_lookup() with various destinations in 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC (and converted to nanoseconds by assuming a constant clock). The following kernel versions bring a notable performance improvement:

In Linux 2.6.39, commit 3630b7c050d9, David Miller removes the hash-based routing table implementation to switch to the LPC-trie implementation (available since Linux 2.6.13 as a compile-time option). This brings a small regression for the scenario with many host routes but boosts the performance for the general case.
In Linux 3.0, commit 281dc5c5ec0f, the improvement is not related to the network subsystem. Linus Torvalds disables the compiler size-optimization from the default configuration. It was believed that optimizing for size would help keeping the instruction cache efficient. However, compilers generated under-performing code on x86 when this option was enabled.
In Linux 3.6, commit f4530fa574df, David Miller adds an optimization to not evaluate IP rules when they are left unconfigured. From this point, the use of the CONFIG_IP_MULTIPLE_TABLES option doesn t impact the performances unless some IP rules are configured. This version also removes the route cache (commit 5e9965c15ba8). However, this has no effect on the benchmark as it directly calls fib_lookup() which doesn t involve the cache.
In Linux 4.0, notably commit 9f9e636d4f89, Alexander Duyck adds several optimizations to the trie lookup algorithm. It really pays off!
In Linux 4.1, commit 0ddcf43d5d4a, Alexander Duyck collapses the local and main tables when no specific IP rules are configured. For non-local traffic, both those tables were looked up.

Compiling old kernels with an updated userland may still require some small patches.
The kernels are compiled with the CONFIG_SMP option to use the hierarchical RCU and activate more of the same code paths as actual routers. However, progress on parallelism are left unnoticed.

21 June 2017

Vincent Bernat: IPv4 route lookup on Linux

TL;DR: With its implementation of IPv4 routing tables using LPC-tries, Linux offers good lookup performance (50 ns for a full view) and low memory usage (64 MiB for a full view).

During the lifetime of an IPv4 datagram inside the Linux kernel, one important step is the route lookup for the destination address through the fib_lookup() function. From essential information about the datagram (source and destination IP addresses, interfaces, firewall mark, ), this function should quickly provide a decision. Some possible options are:

local delivery (RTN_LOCAL),
forwarding to a supplied next hop (RTN_UNICAST),
silent discard (RTN_BLACKHOLE).

Since 2.6.39, Linux stores routes into a compressed prefix tree (commit 3630b7c050d9). In the past, a route cache was maintained but it has been removed¹ in Linux 3.6.

Route lookup in a trie Looking up a route in a routing table is to find the most specific prefix matching the requested destination. Let s assume the following routing table:

$ ip route show scope global table 100
default via 203.0.113.5 dev out2
192.0.2.0/25
        nexthop via 203.0.113.7  dev out3 weight 1
        nexthop via 203.0.113.9  dev out4 weight 1
192.0.2.47 via 203.0.113.3 dev out1
192.0.2.48 via 203.0.113.3 dev out1
192.0.2.49 via 203.0.113.3 dev out1
192.0.2.50 via 203.0.113.3 dev out1

Here are some examples of lookups and the associated results:

Destination IP	Next hop
`192.0.2.49`	`203.0.113.3` via `out1`
`192.0.2.50`	`203.0.113.3` via `out1`
`192.0.2.51`	`203.0.113.7` via `out3` or `203.0.113.9` via `out4` (ECMP)
`192.0.2.200`	`203.0.113.5` via `out2`

A common structure for route lookup is the trie, a tree structure where each node has its parent as prefix.

Lookup with a simple trie The following trie encodes the previous routing table: For each node, the prefix is known by its path from the root node and the prefix length is the current depth. A lookup in such a trie is quite simple: at each step, fetch the n^th bit of the IP address, where n is the current depth. If it is 0, continue with the first child. Otherwise, continue with the second. If a child is missing, backtrack until a routing entry is found. For example, when looking for `192.0.2.50`, we will find the result in the corresponding leaf (at depth 32). However for `192.0.2.51`, we will reach `192.0.2.50/31` but there is no second child. Therefore, we backtrack until the `192.0.2.0/25` routing entry. Adding and removing routes is quite easy. From a performance point of view, the lookup is done in constant time relative to the number of routes (due to maximum depth being capped to 32). Quagga is an example of routing software still using this simple approach.

Lookup with a path-compressed trie In the previous example, most nodes only have one child. This leads to a lot of unneeded bitwise comparisons and memory is also wasted on many nodes. To overcome this problem, we can use path compression: each node with only one child is removed (except if it also contains a routing entry). Each remaining node gets a new property telling how many input bits should be skipped. Such a trie is also known as a Patricia trie or a radix tree. Here is the path-compressed version of the previous trie: Since some bits have been ignored, on a match, a final check is executed to ensure all bits from the found entry are matching the input IP address. If not, we must act as if the entry wasn t found (and backtrack to find a matching prefix). The following figure shows two IP addresses matching the same leaf: The reduction on the average depth of the tree compensates the necessity to handle those false positives. The insertion and deletion of a routing entry is still easy enough. Many routing systems are using Patricia trees:

OpenBSD

NetBSD

FreeBSD

GoBGP (through go-radix)

Linux for IPv6

Lookup with a level-compressed trie In addition to path compression, level compression² detects parts of the trie that are densily populated and replace them with a single node and an associated vector of 2^k children. This node will handle k input bits instead of just one. For example, here is a level-compressed version our previous trie: Such a trie is called LC-trie or LPC-trie and offers higher lookup performances compared to a radix tree. An heuristic is used to decide how many bits a node should handle. On Linux, if the ratio of non-empty children to all children would be above 50% when the node handles an additional bit, the node gets this additional bit. On the other hand, if the current ratio is below 25%, the node loses the responsibility of one bit. Those values are not tunable. Insertion and deletion becomes more complex but lookup times are also improved.

Implementation in Linux The implementation for IPv4 in Linux exists since 2.6.13 (commit 19baf839ff4a) and is enabled by default since 2.6.39 (commit 3630b7c050d9). Here is the representation of our example routing table in memory³: Memory representation of a trie

There are several structures involved:

struct fib_table represents a routing table,
struct trie represents a complete trie,
struct key_vector represents either an internal node (when bits is not zero) or a leaf,
struct fib_info contains the characteristics shared by several routes (like a next-hop gateway and an output interface),
struct fib_alias is the glue between the leaves and the fib_info structures.

The trie can be retrieved through /proc/net/fib_trie:

$ cat /proc/net/fib_trie
Id 100:
  +-- 0.0.0.0/0 2 0 2
      -- 0.0.0.0
        /0 universe UNICAST
     +-- 192.0.2.0/26 2 0 1
         -- 192.0.2.0
           /25 universe UNICAST
         -- 192.0.2.47
           /32 universe UNICAST
        +-- 192.0.2.48/30 2 0 1
            -- 192.0.2.48
              /32 universe UNICAST
            -- 192.0.2.49
              /32 universe UNICAST
            -- 192.0.2.50
              /32 universe UNICAST
[...]

For internal nodes, the numbers after the prefix are:

the number of bits handled by the node,
the number of full children (they only handle one bit),
the number of empty children.

Moreover, if the kernel was compiled with CONFIG_IP_FIB_TRIE_STATS, some interesting statistics are available in /proc/net/fib_triestat⁴:

$ cat /proc/net/fib_triestat
Basic info: size of leaf: 48 bytes, size of tnode: 40 bytes.
Id 100:
        Aver depth:     2.33
        Max depth:      3
        Leaves:         6
        Prefixes:       6
        Internal nodes: 3
          2: 3
        Pointers: 12
Null ptrs: 4
Total size: 1  kB
[...]

When a routing table is very dense, a node can handle many bits. For example, a densily populated routing table with 1 million entries packed in a /12 can have one internal node handling 20 bits. In this case, route lookup is essentially reduced to a lookup in a vector. The following graph shows the number of internal nodes used relative to the number of routes for different scenarios (routes extracted from an Internet full view, /32 routes spreaded over 4 different subnets with various densities). When routes are densily packed, the number of internal nodes are quite limited. Internal nodes and null pointers

Performance So how performant is a route lookup? The maximum depth stays low (about 6 for a full view), so a lookup should be quite fast. With the help of a small kernel module, we can accurately benchmark⁵ the `fib_lookup()` function: The lookup time is loosely tied to the maximum depth. When the routing table is densily populated, the maximum depth is low and the lookup times are fast. When forwarding at 10 Gbps, the time budget for a packet would be about 50 ns. Since this is also the time needed for the route lookup alone in some cases, we wouldn t be able to forward at line rate with only one core. Nonetheless, the results are pretty good and they are expected to scale linearly with the number of cores. The measurements are done with a Linux kernel 4.11 from Debian unstable. I have gathered performance metrics accross kernel versions in Performance progression of IPv4 route lookup on Linux . Another interesting figure is the time it takes to insert all those routes into the kernel. Linux is also quite efficient in this area since you can insert 2 million routes in less than 10 seconds:

Memory usage The memory usage is available directly in `/proc/net/fib_triestat`. The statistic provided doesn t account for the `fib_info` structures, but you should only have a handful of them (one for each possible next-hop). As you can see on the graph below, the memory use is linear with the number of routes inserted, whatever the shape of the routes is. The results are quite good. With only 256 MiB, about 2 million routes can be stored!

Routing rules Unless configured without CONFIG_IP_MULTIPLE_TABLES, Linux supports several routing tables and has a system of configurable rules to select the table to use. These rules can be configured with ip rule. By default, there are three of them:

$ ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

Linux will first lookup for a match in the local table. If it doesn t find one, it will lookup in the main table and at last resort, the default table.

Builtin tables The local table contains routes for local delivery:

$ ip route show table local
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
broadcast 192.168.117.0 dev eno1 proto kernel scope link src 192.168.117.55
local 192.168.117.55 dev eno1 proto kernel scope host src 192.168.117.55
broadcast 192.168.117.63 dev eno1 proto kernel scope link src 192.168.117.55

This table is populated automatically by the kernel when addresses are configured. Let s look at the three last lines. When the IP address 192.168.117.55 was configured on the eno1 interface, the kernel automatically added the appropriate routes:

a route for 192.168.117.55 for local unicast delivery to the IP address,
a route for 192.168.117.255 for broadcast delivery to the broadcast address,
a route for 192.168.117.0 for broadcast delivery to the network address.

When 127.0.0.1 was configured on the loopback interface, the same kind of routes were added to the local table. However, a loopback address receives a special treatment and the kernel also adds the whole subnet to the local table. As a result, you can ping any IP in 127.0.0.0/8:

$ ping -c1 127.42.42.42
PING 127.42.42.42 (127.42.42.42) 56(84) bytes of data.
64 bytes from 127.42.42.42: icmp_seq=1 ttl=64 time=0.039 ms
--- 127.42.42.42 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms

The main table usually contains all the other routes:

$ ip route show table main
default via 192.168.117.1 dev eno1 proto static metric 100
192.168.117.0/26 dev eno1 proto kernel scope link src 192.168.117.55 metric 100

The default route has been configured by some DHCP daemon. The connected route (scope link) has been automatically added by the kernel (proto kernel) when configuring an IP address on the eno1 interface. The default table is empty and has little use. It has been kept when the current incarnation of advanced routing has been introduced in Linux 2.1.68 after a first tentative using classes in Linux 2.1.15⁶.

Performance Since Linux 4.1 (commit 0ddcf43d5d4a), when the set of rules is left unmodified, the main and local tables are merged and the lookup is done with this single table (and the default table if not empty). Moreover, since Linux 3.0 (commit f4530fa574df), without specific rules, there is no performance hit when enabling the support for multiple routing tables. However, as soon as you add new rules, some CPU cycles will be spent for each datagram to evaluate them. Here is a couple of graphs demonstrating the impact of routing rules on lookup times: Routing rules impact on performance

For some reason, the relation is linear when the number of rules is between 1 and 100 but the slope increases noticeably past this threshold. The second graph highlights the negative impact of the first rule (about 30 ns). A common use of rules is to create virtual routers: interfaces are segregated into domains and when a datagram enters through an interface from domain A, it should use routing table A:

# ip rule add iif vlan457 table 10
# ip rule add iif vlan457 blackhole
# ip rule add iif vlan458 table 20
# ip rule add iif vlan458 blackhole

The blackhole rules may be removed if you are sure there is a default route in each routing table. For example, we add a blackhole default with a high metric to not override a regular default route:

# ip route add blackhole default metric 9999 table 10
# ip route add blackhole default metric 9999 table 20
# ip rule add iif vlan457 table 10
# ip rule add iif vlan458 table 20

To reduce the impact on performance when many interface-specific rules are used, interfaces can be attached to VRF instances and a single rule can be used to select the appropriate table:

# ip link add vrf-A type vrf table 10
# ip link set dev vrf-A up
# ip link add vrf-B type vrf table 20
# ip link set dev vrf-B up
# ip link set dev vlan457 master vrf-A
# ip link set dev vlan458 master vrf-B
# ip rule show
0:      from all lookup local
1000:   from all lookup [l3mdev-table]
32766:  from all lookup main
32767:  from all lookup default

The special l3mdev-table rule was automatically added when configuring the first VRF interface. This rule will select the routing table associated to the VRF owning the input (or output) interface. VRF was introduced in Linux 4.3 (commit 193125dbd8eb), the performance was greatly enhanced in Linux 4.8 (commit 7889681f4a6c) and the special routing rule was also introduced in Linux 4.8 (commit 96c63fa7393d, commit 1aa6c4f6b8cd). You can find more details about it in the kernel documentation.

Conclusion The takeaways from this article are:

route lookup times hardly increase with the number of routes,

densily packed /32 routes lead to amazingly fast route lookups,

memory use is low (128 MiB par million routes),

no optimization is done on routing rules.

The routing cache was subject to reasonably easy to launch denial of service attacks. It was also believed to not be efficient for high volume sites like Google but I have first-hand experience it was not the case for moderately high volume sites.

IP-address lookup using LC-tries , IEEE Journal on Selected Areas in Communications, 17(6):1083-1092, June 1999.

For internal nodes, the `key_vector` structure is embedded into a `tnode` structure. This structure contains information rarely used during lookup, notably the reference to the parent that is usually not needed for backtracking as Linux keeps the nearest candidate in a variable.

One leaf can contain several routes (`struct fib_alias` is a list). The number of prefixes can therefore be greater than the number of leaves. The system also keeps statistics about the distribution of the internal nodes relative to the number of bits they handle. In our example, all the three internal nodes are handling 2 bits.

The measurements are done in a virtual machine with one vCPU. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor was set to performance). The benchmark is single-threaded. It runs a warm-up phase, then executes about 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC.

Fun fact: the documentation of this first tentative of more flexible routing is still available in today s kernel tree and explains the usage of the default class .

3 May 2017

Vincent Bernat: VXLAN: BGP EVPN with Cumulus Quagga

VXLAN is an overlay network to encapsulate Ethernet traffic over an existing (highly available and scalable, possibly the Internet) IP network while accomodating a very large number of tenants. It is defined in RFC 7348. For an uncut introduction on its use with Linux, have a look at my VXLAN & Linux post. VXLAN deployment

In the above example, we have hypervisors hosting a virtual machines from different tenants. Each virtual machine is given access to a tenant-specific virtual Ethernet segment. Users are expecting classic Ethernet segments: no MAC restrictions¹, total control over the IP addressing scheme they use and availability of multicast.

In a large VXLAN deployment, two aspects need attention:

discovery of other endpoints (VTEPs) sharing the same VXLAN segments, and
avoidance of BUM frames (broadcast, unknown unicast and multicast) as they have to be forwarded to all VTEPs.

A typical solution for the first point is using multicast. For the second point, this is source-address learning.

Introduction to BGP EVPN BGP EVPN (RFC 7432 and draft-ietf-bess-evpn-overlay for its application to VXLAN) is a standard control protocol to efficiently solves those two aspects without relying on multicast nor source-address learning. BGP EVPN relies on BGP (RFC 4271) and its MP-BGP extensions (RFC 4760). BGP is the routing protocol powering the Internet. It is highly scalable and interoperable. It is also extensible and one of its extension is MP-BGP. This extension can carry reachability information (NLRI) for multiple protocols (IPv4, IPv6, L3VPN and in our case EVPN). EVPN is a special family to advertise MAC addresses and the remote equipments they are attached to. There are basically two kinds of reachability information a VTEP sends through BGP EVPN:

the VNIs they have interest in (type 3 routes), and

for each VNI, the local MAC addresses (type 2 routes).

The protocol also covers other aspects of virtual Ethernet segments (L3 reachability information from ARP/ND caches, MAC mobility and multi-homing²) but we won t describe them here. To deploy BGP EVPN, a typical solution is to use several route reflectors (both for redundancy and scalability), like in the picture below. Each VTEP opens a BGP session to at least two route reflectors, sends its information (MACs and VNIs) and receives others . This reduces the number of BGP sessions to configure. Compared to other solutions to deploy VXLAN, BGP EVPN has three main advantages:

interoperability with other vendors (notably Juniper and Cisco),

proven scalability (a typical BGP routers handle several millions of routes), and

possibility to enforce fine-grained policies.

On Linux, Cumulus Quagga is a fairly complete implementation of BGP EVPN (type 3 routes for VTEP discovery, type 2 routes with MAC or IP addresses, MAC mobility when a host changes from one VTEP to another one) which requires very little configuration. This is a fork of Quagga and currently used in Cumulus Linux, a network operating system based on Debian powering switches from various brands. At some point, BGP EVPN support will be contributed back to FRR, a community-maintained fork of Quagga³. It should be noted the BGP EVPN implementation of Cumulus Quagga currently only supports IPv4.

Route reflector setup Before configuring each VTEP, we need to configure two or more route reflectors. There are many solutions. I will present three of them:

using Cumulus Quagga,

using GoBGP, an implementation of BGP in Go,

using Juniper JunOS.

For reliability purpose, it s possible (and easy) to use one implementation for some route reflectors and another implementation for the other ones. The proposed configurations are quite minimal. However, it is possible to centralize policies on the route reflectors (e.g. routes tagged with some community can only be readvertised to some group of VTEPs).

Using Quagga The configuration is pretty simple. We suppose the configured route reflector has 203.0.113.254 configured as a loopback IP.

router bgp 65000
  bgp router-id 203.0.113.254
  bgp cluster-id 203.0.113.254
  bgp log-neighbor-changes
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source 203.0.113.254
  bgp listen range 203.0.113.0/24 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   neighbor fabric route-reflector-client
  exit-address-family
  !
  exit
!

A peer group fabric is defined and we leverage the dynamic neighbor feature of Cumulus Quagga: we don t have to explicitely define each neighbor. Any client from 203.0.113.0/24 and presenting itself as part of AS 65000 can connect. All sent EVPN routes will be accepted and reflected to the other clients. You don t need to run Zebra, the route engine talking with the kernel. Instead, start bgpd with the --no_kernel flag.

Using GoBGP GoBGP is a clean implementation of BGP in Go⁴. It exposes an RPC API for configuration (but accepts a configuration file and comes with a command-line client). It doesn t support dynamic neighbors, so you ll have to use the API, the command-line client or some templating language to automate their declaration. A configuration with only one neighbor is like this:

global:
  config:
    as: 65000
    router-id: 203.0.113.254
    local-address-list:
      - 203.0.113.254
neighbors:
  - config:
      neighbor-address: 203.0.113.1
      peer-as: 65000
    afi-safis:
      - config:
          afi-safi-name: l2vpn-evpn
    route-reflector:
      config:
        route-reflector-client: true
        route-reflector-cluster-id: 203.0.113.254

More neighbors can be added from the command line:

$ gobgp neighbor add 203.0.113.2 as 65000 \
>         route-reflector-client 203.0.113.254 \
>         --address-family evpn

GoBGP won t try to interact with the kernel which is fine as a route reflector.

Using Juniper JunOS A variety of Juniper products can be a BGP route reflector, notably:

the Juniper QFX5100⁵, a top-of-the-rack switch,
the Juniper MX240⁶, a high-range router,
the Juniper vRR⁷, a virtual appliance running JunOS without a data-plane.

The main factor is the CPU and the memory. The QFX5100 is low on memory and won t support large deployments without some additional policing. Here is a configuration similar to the Quagga one:

interfaces  
    lo0  
        unit 0  
            family inet  
                address 203.0.113.254/32;
             
         
     
 
protocols  
    bgp  
        group fabric  
            family evpn  
                signaling  
                    /* Do not try to install EVPN routes */
                    no-install;
                 
             
            type internal;
            cluster 203.0.113.254;
            local-address 203.0.113.254;
            allow 203.0.113.0/24;
         
     
 
routing-options  
    router-id 203.0.113.254;
    autonomous-system 65000;

VTEP setup The next step is to configure each VTEP/hypervisor. Each VXLAN is locally configured using a bridge for local virtual interfaces, like illustrated in the below schema. The bridge is taking care of the local MAC addresses (notably, using source-address learning) and the VXLAN interface takes care of the remote MAC addresses (received with BGP EVPN). Bridged VXLAN device

VXLANs can be provisioned with the following script. Source-address learning is disabled as we will rely solely on BGP EVPN to synchronize FDBs between the hypervisors.

for vni in 100 200; do
    # Create VXLAN interface
    ip link add vxlan$ vni  type vxlan
        id $ vni  \
        dstport 4789 \
        local 203.0.113.2 \
        nolearning
    # Create companion bridge
    brctl addbr br$ vni 
    brctl addif br$ vni  vxlan$ vni 
    brctl stp br$ vni  off
    ip link set up dev br$ vni 
    ip link set up dev vxlan$ vni 
done
# Attach each VM to the appropriate segment
brctl addif br100 vnet10
brctl addif br100 vnet11
brctl addif br200 vnet12

The configuration of Cumulus Quagga is similar to the one used for a route reflector, except we use the advertise-all-vni directive to publish all local VNIs.

router bgp 65000
  bgp router-id 203.0.113.2
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source dummy0
  ! BGP sessions with route reflectors
  neighbor 203.0.113.253 peer-group fabric
  neighbor 203.0.113.254 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   advertise-all-vni
  exit-address-family
  !
  exit
!

If everything works as expected, the instances sharing the same VNI should be able to ping each other. If IPv6 is enabled on the VMs, the ping command shows if everything is in order:

$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Verification Step by step, let s check how everything comes together.

Getting VXLAN information from the kernel On each VTEP, Quagga should be able to retrieve the information about configured VXLANs. This can be checked with vtysh:

# show interface vxlan100
Interface vxlan100 is up, line protocol is up
  Link ups:       1    last: 2017/04/29 20:01:33.43
  Link downs:     0    last: (never)
  PTM status: disabled
  vrf: Default-IP-Routing-Table
  index 11 metric 0 mtu 1500
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: 62:42:7a:86:44:01
  inet6 fe80::6042:7aff:fe86:4401/64
  Interface Type Vxlan
  VxLAN Id 100
  Access VLAN Id 1
  Master (bridge) ifindex 9 ifp 0x56536e3f3470

The important points are:

the VNI is 100, and
the bridge device was correctly detected.

Quagga should also be able to retrieve information about the local MAC addresses :

# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 2
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100

BGP sessions Each VTEP has to establish a BGP session to the route reflectors. On the VTEP, this can be checked by running vtysh:

# show bgp neighbors 203.0.113.254
BGP neighbor is 203.0.113.254, remote AS 65000, local AS 65000, internal link
 Member of peer-group fabric for session parameters
  BGP version 4, remote router ID 203.0.113.254
  BGP state = Established, up for 00:00:45
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      L2VPN EVPN: RX advertised L2VPN EVPN
    Route refresh: advertised and received(new)
    Address family L2VPN EVPN: advertised and received
    Hostname Capability: advertised
    Graceful Restart Capabilty: advertised
[...]
 For address family: L2VPN EVPN
  fabric peer-group member
  Update group 1, subgroup 1
  Packet Queue length 0
  Community attribute sent to this neighbor(both)
  8 accepted prefixes

  Connections established 1; dropped 0
  Last reset never
Local host: 203.0.113.2, Local port: 37603
Foreign host: 203.0.113.254, Foreign port: 179

The output includes the following information:

the BGP state is Established,
the address family L2VPN EVPN is correctly advertised, and
8 routes are received from this route reflector.

The state of the BGP sessions can also be checked from the route reflectors. With GoBGP, use the following command:

# gobgp neighbor 203.0.113.2
BGP neighbor is 203.0.113.2, remote AS 65000, route-reflector-client
  BGP version 4, remote router ID 203.0.113.2
  BGP state = established, up for 00:04:30
  BGP OutQ = 0, Flops = 0
  Hold time is 9, keepalive interval is 3 seconds
  Configured hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    multiprotocol:
        l2vpn-evpn:     advertised and received
    route-refresh:      advertised and received
    graceful-restart:   received
    4-octet-as: advertised and received
    add-path:   received
    UnknownCapability(73):      received
    cisco-route-refresh:        received
[...]
  Route statistics:
    Advertised:             8
    Received:               5
    Accepted:               5

With JunOS, use the below command:

> show bgp neighbor 203.0.113.2
Peer: 203.0.113.2+38089 AS 65000 Local: 203.0.113.254+179 AS 65000
  Group: fabric                Routing-Instance: master
  Forwarding routing-instance: master
  Type: Internal    State: Established
  Last State: OpenConfirm   Last Event: RecvKeepAlive
  Last Error: None
  Options: <Preference LocalAddress Cluster AddressFamily Rib-group Refresh>
  Address families configured: evpn
  Local Address: 203.0.113.254 Holdtime: 90 Preference: 170
  NLRI evpn: NoInstallForwarding
  Number of flaps: 0
  Peer ID: 203.0.113.2     Local ID: 203.0.113.254     Active Holdtime: 9
  Keepalive Interval: 3          Group index: 0    Peer index: 2
  I/O Session Thread: bgpio-0 State: Enabled
  BFD: disabled, down
  NLRI for restart configured on peer: evpn
  NLRI advertised by peer: evpn
  NLRI for this session: evpn
  Peer supports Refresh capability (2)
  Stale routes from peer are kept for: 300
  Peer does not support Restarter functionality
  NLRI that restart is negotiated for: evpn
  NLRI of received end-of-rib markers: evpn
  NLRI of all end-of-rib markers sent: evpn
  Peer does not support LLGR Restarter or Receiver functionality
  Peer supports 4 byte AS extension (peer-as 65000)
  NLRI's for which peer can receive multiple paths: evpn
  Table bgp.evpn.0 Bit: 20000
    RIB State: BGP restart is complete
    RIB State: VPN restart is complete
    Send state: in sync
    Active prefixes:              5
    Received prefixes:            5
    Accepted prefixes:            5
    Suppressed due to damping:    0
    Advertised prefixes:          8
  Last traffic (seconds): Received 276  Sent 170  Checked 276
  Input messages:  Total 61     Updates 3       Refreshes 0     Octets 1470
  Output messages: Total 62     Updates 4       Refreshes 0     Octets 1775
  Output Queue[1]: 0            (bgp.evpn.0, evpn)

If a BGP session cannot be established, the logs of each BGP daemon should mention the cause.

Sent routes From each VTEP, Quagga needs to send:

one type 3 route for each local VNI, and
one type 2 route for each local MAC address.

The best place to check the received routes is on one of the route reflectors. If you are using JunOS, the following command will display the received routes from the provided VTEP:

> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2
bgp.evpn.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
  2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP
*                         203.0.113.2                  100        I
  2:203.0.113.2:100::0::50:54:33:00:00:0b/304 MAC/IP
*                         203.0.113.2                  100        I
  3:203.0.113.2:100::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I
  3:203.0.113.2:200::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I

There is one type 3 route for VNI 100 and another one for VNI 200. There are also two type 2 routes for two MAC addresses on VNI 100. To get more information, you can add the keyword extensive. Here is a type 3 route advertising 203.0.113.2 as a VTEP for VNI 100⁸:

> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 3:203.0.113.2:100::0::203.0.113.2/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]

Here is a type 2 route announcing the location of the 50:54:33:00:00:0a MAC address for VNI 100:

> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Route Label: 100
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]

With Quagga, you can get a similar output with vtysh:

# show bgp evpn route
BGP table version is 0, local router ID is 203.0.113.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 203.0.113.2:100
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0a]
                    203.0.113.2                   100      0 i
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0b]
                    203.0.113.2                   100      0 i
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
Route Distinguisher: 203.0.113.2:200
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
[...]

With GoBGP, use the following command:

# gobgp global rib -a evpn   grep rd:203.0.113.2:200
    Network  Next Hop             AS_PATH              Age        Attrs
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0b][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:200][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[200]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]
*>  [type:multicast][rd:203.0.113.2:100][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:multicast][rd:203.0.113.2:200][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]

Received routes Each VTEP should have received the type 2 and type 3 routes from its fellow VTEPs, through the route reflectors. You can check with the show bgp evpn route command of vtysh. Does Quagga correctly understand the received routes? The type 3 routes are translated to an assocation between the remote VTEPs and the VNIs:

# show evpn vni
Number of VNIs: 2
VNI        VxLAN IF              VTEP IP         # MACs   # ARPs   Remote VTEPs
100        vxlan100              203.0.113.2     4        0        203.0.113.3
                                                                   203.0.113.1
200        vxlan200              203.0.113.2     3        0        203.0.113.3
                                                                   203.0.113.1

The type 2 routes are translated to an association between the remote MACs and the remote VTEPs:

# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:09 remote 203.0.113.1
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100
50:54:33:00:00:0c remote 203.0.113.3

FDB configuration The last step is to ensure Quagga has correctly provided the received information to the kernel. This can be checked with the bridge command:

# bridge fdb show dev vxlan100   grep dst
00:00:00:00:00:00 dst 203.0.113.1 self permanent
00:00:00:00:00:00 dst 203.0.113.3 self permanent
50:54:33:00:00:0c dst 203.0.113.3 self
50:54:33:00:00:09 dst 203.0.113.1 self

All good! The two first lines are the translation of the type 3 routes (any BUM frame will be sent to both 203.0.113.1 and 203.0.113.3) and the two last ones are the translation of the type 2 routes.

Interoperability One of the strength of BGP EVPN is the interoperability with other network vendors. To demonstrate it works as expected, we will configure a Juniper vMX to act as a VTEP. First, we need to configure the physical bridge⁹. This is similar to the use of ip link and brctl with Linux. We only configure one physical interface with two old-school VLANs paired with matching VNIs.

interfaces  
    ge-0/0/1  
        unit 0  
            family bridge  
                interface-mode trunk;
                vlan-id-list [ 100 200 ];
             
         
     
 
routing-instances  
    switch  
        instance-type virtual-switch;
        interface ge-0/0/1.0;
        bridge-domains  
            vlan100  
                domain-type bridge;
                vlan-id 100;
                vxlan  
                    vni 100;
                    ingress-node-replication;
                 
             
            vlan200  
                domain-type bridge;
                vlan-id 200;
                vxlan  
                    vni 200;
                    ingress-node-replication;

Then, we configure BGP EVPN to advertise all known VNIs. The configuration is quite similar to the one we did with Quagga:

protocols  
    bgp  
        group fabric  
            type internal;
            multihop;
            family evpn signaling;
            local-address 203.0.113.3;
            neighbor 203.0.113.253;
            neighbor 203.0.113.254;
         
     
 
routing-instances  
    switch  
        vtep-source-interface lo0.0;
        route-distinguisher 203.0.113.3:1; #  
        vrf-import EVPN-VRF-VXLAN;
        vrf-target  
            target:65000:1;
            auto;
         
        protocols  
            evpn  
                encapsulation vxlan;
                extended-vni-list all;
                multicast-mode ingress-replication;
             
         
     
 
routing-options  
    router-id 203.0.113.3;
    autonomous-system 65000;
 
policy-options  
    policy-statement EVPN-VRF-VXLAN  
        then accept;

We also need a small compatibility patch for Cumulus Quagga¹⁰. The routes sent by this configuration are very similar to the routes sent by Quagga. The main differences are:

on JunOS, the route distinguisher is configured statically (in ), and
on JunOS, the VNI is also encoded as an Ethernet tag ID.

Here is a type 3 route, as sent by JunOS:

> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 3:203.0.113.3:1::100::203.0.113.3/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
     PMSI: Flags 0x0: Label 6: Type INGRESS-REPLICATION 203.0.113.3
[...]

Here is a type 2 route:

> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 2:203.0.113.3:1::200::50:54:33:00:00:0f/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Route Label: 200
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435656 encapsulation:vxlan(0x8)
[...]

We can check that the vMX is able to make sense of the routes it receives from its peers running Quagga:

> show evpn database l2-domain-id 100
Instance: switch
VLAN  DomainId  MAC address        Active source                  Timestamp        IP address
     100        50:54:33:00:00:0c  203.0.113.1                    Apr 30 12:46:20
     100        50:54:33:00:00:0d  203.0.113.2                    Apr 30 12:32:42
     100        50:54:33:00:00:0e  203.0.113.2                    Apr 30 12:46:20
     100        50:54:33:00:00:0f  ge-0/0/1.0                     Apr 30 12:45:55

On the other end, if we look at one of the Quagga-based VTEP, we can check the received routes are correctly understood:

# show evpn vni 100
VNI: 100
 VxLAN interface: vxlan100 ifIndex: 9 VTEP IP: 203.0.113.1
 Remote VTEPs for this VNI:
  203.0.113.3
  203.0.113.2
 Number of MACs (local and remote) known for this VNI: 4
 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 0
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0c local  eth1.100
50:54:33:00:00:0d remote 203.0.113.2
50:54:33:00:00:0e remote 203.0.113.2
50:54:33:00:00:0f remote 203.0.113.3

Get in touch if you have some success with other vendors!

For example, they may use bridges to connect containers together.
Such a feature can replace proprietary implementations of MC-LAG allowing several VTEPs to act as a endpoint for a single link aggregation group. This is not needed on our scenario where hypervisors act as VTEPs.
The development of Quagga is slow and closed . New features are often stalled. FRR is placed under the umbrella of the Linux Foundation, has a GitHub-centered development model and an election process. It already has several interesting enhancements (notably, BGP add-path, BGP unnumbered, MPLS and LDP).
I am unenthusiastic about projects whose the sole purpose is to rewrite something in Go. However, while being quite young, GoBGP is quite valuable on its own (good architecture, good performance).
The 48-port version is around $10,000 with the BGP license.
An empty chassis with a dual routing engine (RE-S-1800X4-16G) is around $30,000.
I don t know how pricey the vRR is. For evaluation purposes, it can be downloaded for free if you are a customer.
The value 100 used in the route distinguishier (203.0.113.2:100) is not the one used to encode the VNI. The VNI is encoded in the route target (65000:268435556), in the 24 least signifiant bits (268435556 & 0xffffff equals 100). As long as VNIs are unique, we don t have to understand those details.
For some reason, the use of a virtual switch is mandatory. This is specific to this platform: a QFX doesn t require this.
The encoding of the VNI into the route target is being standardized in draft-ietf-bess-evpn-overlay. Juniper already implements this draft.

Vincent Bernat: VXLAN & Linux

VXLAN is an overlay network to carry Ethernet traffic over an existing (highly available and scalable) IP network while accommodating a very large number of tenants. It is defined in RFC 7348. Starting from Linux 3.12, the VXLAN implementation is quite complete as both multicast and unicast are supported as well as IPv6 and IPv4. Let s explore the various methods to configure it. VXLAN setup

To illustrate our examples, we use the following setup:

an underlay IP network (highly available and scalable, possibly the Internet),
three Linux bridges acting as VXLAN tunnel endpoints (VTEP),
four servers believing they share a common Ethernet segment.

A VXLAN tunnel extends the individual Ethernet segments accross the three bridges, providing a unique (virtual) Ethernet segment. From one host (e.g. H1), we can reach directly all the other hosts in the virtual segment:

$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Basic usage The reference deployment for VXLAN is to use an IP multicast group to join the other VTEPs:

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   group ff05::100 \
>   dev eth0 \
>   ttl 5
# brctl addbr br100
# brctl addif br100 vxlan100
# brctl addif br100 vnet22
# brctl addif br100 vnet25
# brctl stp br100 off
# ip link set up dev br100
# ip link set up dev vxlan100

The above commands create a new interface acting as a VXLAN tunnel endpoint, named vxlan100 and put it in a bridge with some regular interfaces¹. Each VXLAN segment is associated to a 24-bit segment ID, the VXLAN Network Identifier (VNI). In our example, the default VNI is specified with id 100. When VXLAN was first implemented in Linux 3.7, the UDP port to use was not defined. Several vendors were using 8472 and Linux took the same value. To avoid breaking existing deployments, this is still the default value. Therefore, if you want to use the IANA-assigned port, you need to explicitely set it with dstport 4789. As we want to use multicast, we have to specify a multicast group to join (group ff05::100), as well as a physical device (

dev
eth0

). With multicast, the default TTL is 1. If your multicast network leverages some routing, you ll have to increase the value a bit, like here with ttl 5. The vxlan100 device acts as a bridge device with remote VTEPs as virtual ports:

it sends broadcast, unknown unicast and multicast (BUM) frames to all VTEPs using the multicast group, and
it discovers the association from Ethernet MAC addresses to VTEP IP addresses using source-address learning.

The following figure summarizes the configuration, with the FDB of the Linux bridge (learning local MAC addresses) and the FDB of the VXLAN device (learning distant MAC addresses): Bridged VXLAN device

The FDB of the VXLAN device can be observed with the bridge command. If the destination MAC is present, the frame is sent to the associated VTEP (unicast). The all-zero address is only used when a lookup for the destination MAC fails.

# bridge fdb show dev vxlan100   grep dst
00:00:00:00:00:00 dst ff05::100 via eth0 self permanent
50:54:33:00:00:0b dst 2001:db8:3::1 self
50:54:33:00:00:08 dst 2001:db8:1::1 self

If you are interested to get more details on how to setup a multicast network and build VXLAN segments on top of it, see my Network virtualization with VXLAN article.

Without multicast Using VXLAN over a multicast IP network has several benefits:

automatic discovery of other VTEPs sharing the same multicast group,

good bandwidth usage (packets are replicated as late as possible),

decentralized and controller-less design².

However, multicast is not available everywhere and managing it at scale can be difficult. In Linux 3.8, the DOVE extensions have been added to the VXLAN implementation, removing the dependency on multicast.

Unicast with static flooding We can replace multicast by head-end replication of BUM frames to a statically configured lists of remote VTEPs³:

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1

The VXLAN is defined without a remote multicast group. Instead, all the remote VTEPs are associated with the all-zero address: a BUM frame will be duplicated to all those destinations. The VXLAN device will still learn remote addresses automatically using source-address learning. It is a very simple solution. With a bit of automation, you can keep the default FDB entries up-to-date easily. However, the host will have to duplicate each BUM frame (head-end replication) as many times as there are remote VTEPs. This is quite reasonable if you have a dozen of them. This may become out-of-hand if you have thousands of them. Cumulus vxfld daemon is an example of use of this strategy (in the head-end replication mode).

Unicast with static L2 entries When the associations of MAC addresses and VTEPs are known, it is possible to pre-populate the FDB and disable learning:

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1

Thanks to the nolearning flag, source-address learning is disabled. Therefore, if a MAC is missing, the frame will always be sent using the all-zero entries. The all-zero entries are still needed for broadcast and multicast traffic (e.g. ARP and IPv6 neighbor discovery). This kind of setup works well to provide virtual L2 networks to virtual machines (no L3 information available). You need some glue to update the FDB entries. BGP EVPN with Cumulus Quagga is an example of use of this strategy (see VXLAN: BGP EVPN with Cumulus Quagga for additional information).

Unicast with static L3 entries In the previous example, we had to keep the all-zero entries for ARP and IPv6 neighbor discovery to work correctly. However, Linux can answer to neighbor requests on behalf of the remote nodes⁴. When this feature is enabled, the default entries are not needed anymore (but you could keep them):

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning \
>   proxy
# ip -6 neigh add 2001:db8:ff::11 lladdr 50:54:33:00:00:09 dev vxlan100
# ip -6 neigh add 2001:db8:ff::12 lladdr 50:54:33:00:00:0a dev vxlan100
# ip -6 neigh add 2001:db8:ff::13 lladdr 50:54:33:00:00:0b dev vxlan100
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1

This setup totally eliminates head-end replication. However, protocols relying on multicast won t work either. With some automation, this is a setup that should work well with containers: if there is a registry keeping a list of all IP and MAC addresses in use, a program could listen to it and adjust the FDB and the neighbor tables. The VXLAN backend of Docker s libnetwork is an example of use of this strategy (but it also uses the next method).

Unicast with dynamic L3 entries Linux can also notify a program an (L2 or L3) entry is missing. The program queries some central registry and dynamically adds the requested entry. However, for L2 entries, notifications are issued only if:

the destination MAC address is not known,
there is no all-zero entry in the FDB, and
the destination MAC address is not a multicast or broadcast one.

Those limitations prevent us to do a unicast with dynamic L2 entries scenario. First, let s create the VXLAN device with the l2miss and l3miss options⁵:

ip -6 link add vxlan100 type vxlan \
   id 100 \
   dstport 4789 \
   local 2001:db8:1::1 \
   nolearning \
   l2miss \
   l3miss \
   proxy

Notifications are sent to programs listening to an AF_NETLINK socket using the NETLINK_ROUTE protocol. This socket needs to be bound to the RTNLGRP_NEIGH group. The following is doing exactly that and decodes the received notifications:

# ip monitor neigh dev vxlan100
miss 2001:db8:ff::12 STALE
miss lladdr 50:54:33:00:00:0a STALE

The first notification is about a missing neighbor entry for the requested IP address. We can add it with the following command:

ip -6 neigh replace 2001:db8:ff::12 \
    lladdr 50:54:33:00:00:0a \
    dev vxlan100 \
    nud reachable

The entry is not permanent so that we don t need to delete it when it expires. If the address becomes stale, we will get another notification to refresh it. Once the host receives our proxy answer for the neighbor discovery request, it can send a frame with the MAC we gave as destination. The second notification is about the missing FDB entry for this MAC address. We add the appropriate entry with the following command⁶:

bridge fdb replace 50:54:33:00:00:0a \
    dst 2001:db8:2::1 \
    dev vxlan100 dynamic

The entry is not permanent either as it would prevent the MAC to migrate to the local VTEP (a dynamic entry cannot override a permanent entry). This setup works well with containers and a global registry. However, there is small latency penalty for the first connections. Moreover, multicast and broadcast won t be available in the underlay network. The VXLAN backend for flannel, a network fabric for Kubernetes, is an example of this strategy.

Decision There is no one-size-fits-all solution. You should consider the multicast solution if:

you are in an environment where multicast is available,

you are ready to operate (and scale) a multicast network,

you need multicast and broadcast inside the virtual segments,

you don t have L2/L3 addresses available beforehand.

The scalability of such a solution is pretty good if you take care of not putting all VXLAN interfaces into the same multicast group (e.g. use the last byte of the VNI as the last byte of the multicast group). When multicast is not available, another generic solution is BGP EVPN: BGP is used as a controller to ensure distribution of the list of VTEPs and their respective FDBs. As mentioned earlier, an implementation of this solution is Cumulus Quagga. I explore this option in a separate post: VXLAN: BGP EVPN with Cumulus Quagga. If you operate in a container-like environment where L2/L3 addresses are known beforehand, a solution using static and/or dynamic L2 and L3 entries based on a central registry and no source-address learning would also fit the bill. This provides a more security-tight solution (bound resources, MiTM attacks dampened down, inability to amplify bandwidth usage through excessive broadcast). Various environment-specific solutions are available⁷ or you can build your own.

Other considerations Independently of the chosen strategy, here are a few important points to keep in mind when implementing a VXLAN overlay.

Isolation While you may expect VXLAN interfaces to only carry L2 traffic, Linux doesn t disable IP processing. If the destination MAC is a local one, Linux will route or deliver the encapsulated IP packet. Check my post about the proper isolation of a Linux bridge.

Encryption VXLAN enforces isolation between tenants, but the traffic is totally unencrypted. The most direct solution to provide encryption is to use IPsec. Some container-based solutions may come with IPsec support out-of-the box (notably Docker s libnetwork, but flannel has plan for it too). This is quite important for a deployment over a public cloud.

Overhead The format of a VXLAN-encapsulated frame is the following: VXLAN adds a fixed overhead of 50 bytes. If you also use IPsec, the overhead depends on many factors. In transport mode, with AES and SHA256, the overhead is 56 bytes. With NAT traversal, this is 64 bytes (additional UDP header). In tunnel mode, this is 72 bytes. See Cisco IPsec Overhead Calculator Tool. Some users will expect to be able to use an Ethernet MTU of 1500 for the overlay network. Therefore, the underlay MTU should be increased. If it is not possible, ensure the inner MTU (inside the containers or the virtual machines) is correctly decreased⁸.

IPv6 While all the examples above are using IPv6, the ecosystem is not quite ready yet. The multicast L2-only strategy works fine with IPv6 but every other scenario currently needs some patches (1, 2, 3). On top of that, IPv6 may not have been implemented in VXLAN-related tools:

Cumulus Quagga and Cumulus vxfld daemon do not support IPv6 as a transport layer,

flannel has absolutely no IPv6 support,

Docker s libnetwork has no IPv6 support for the VXLAN backend.

Multicast Linux VXLAN implementation doesn t support IGMP snooping. Multicast traffic will be broadcasted to all VTEPs unless multicast MAC addresses are inserted into the FDB.

This is one possible implementation. The bridge is only needed if you require some form of source-address learning for local interfaces. Another strategy is to use MACVLAN interfaces.
The underlay multicast network may still need some central components, like rendez-vous points for PIM-SM protocol. Fortunately, it s possible to make them highly available and scalable (e.g. with Anycast-RP, RFC 4610).

For this example and the following ones, a patch is needed for the ip command (to be included in 4.11) to use IPv6 for transport. In the meantime, here is a quick workaround:

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   remote 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 \
>   dev vxlan100 dst 2001:db8:3::1

You may have to apply an IPv6-related patch to the kernel (to be included in 4.12).
You have to apply an IPv6-related patch to the kernel (to be included in 4.12) to get appropriate notifications for missing IPv6 addresses.
Directly adding the entry after the first notification would have been smarter to avoid unnecessary retransmissions.
flannel and Docker s libnetwork were already mentioned as they both feature a VXLAN backend. There are also some interesting experiments like BaGPipe BGP for Kubernetes which leverages BGP EVPN and is therefore interoperable with other vendors.
There is no such thing as MTU discovery on an Ethernet segment.

12 April 2017

Vincent Bernat: Proper isolation of a Linux bridge

TL;DR: when configuring a Linux bridge, use the following commands to enforce isolation:

# bridge vlan del dev br0 vid 1 self
# echo 1 > /sys/class/net/br0/bridge/vlan_filtering

A network bridge (also commonly called a switch ) brings several Ethernet segments together. It is a common element in most infrastructures. Linux provides its own implementation. A typical use of a Linux bridge is shown below. The hypervisor is running three virtual hosts. Each virtual host is attached to the br0 bridge (represented by the horizontal segment). The hypervisor has two physical network interfaces:

eth0 is attached to a public network providing various services for the virtual hosts (DHCP, DNS, NTP, routers to Internet, ). It is also part of the br0 bridge.
eth1 is attached to an infrastructure network providing various services to the hypervisor (DNS, NTP, configuration management, routers to Internet, ). It is not part of the br0 bridge.

Typical use of Linux bridging with virtual machines

The main expectation of such a setup is that while the virtual hosts should be able to use resources from the public network, they should not be able to access resources from the infrastructure network (including resources hosted on the hypervisor itself, like a SSH server). In other words, we expect a total isolation between the green domain and the purple one. That s not the case. From any virtual host:

# ip route add 192.168.14.3/32 dev eth0
# ping -c 3 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=59 time=0.644 ms
64 bytes from 192.168.14.3: icmp_seq=2 ttl=59 time=0.829 ms
64 bytes from 192.168.14.3: icmp_seq=3 ttl=59 time=0.894 ms
--- 192.168.14.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2033ms
rtt min/avg/max/mdev = 0.644/0.789/0.894/0.105 ms

Why? There are two main factors of this behavior:

A bridge can accept IP traffic. This is a useful feature if you want Linux to act as a bridge and provide some IP services to bridge users (a DHCP relay or a default gateway). This is usually done by configuring the IP address on the bridge device: `ip addr add 192.0.2.2/25 dev br0`.

An interface doesn t need an IP address to process incoming IP traffic. Additionally, by default, Linux accepts to answer ARP requests independently from the incoming interface.

Bridge processing After turning an incoming Ethernet frame into a socket buffer, the network driver transfers the buffer to the `netif_receive_skb()` function. The following actions are executed:

copy the frame to any registered global or per-device taps (e.g. `tcpdump`),

evaluate the ingress policy (configured with `tc`),

hand over the frame to the device-specific receive handler, if any,

hand over the frame to a global or device-specific protocol handler (e.g. IPv4, ARP, IPv6).

For a bridged interface, the kernel has configured a device-specific receive handler, `br_handle_frame()`. This function won t allow any additional processing in the context of the incoming interface, except for STP and LLDP frames or if brouting is enabled¹. Therefore, the protocol handlers are never executed in this case. After a few additional checks, Linux will decide if the frame has to be locally delivered:

the entry for the target MAC in the FDB is marked for local delivery, or

the target MAC is a broadcast or a multicast address.

In this case, the frame is passed to the `br_pass_frame_up()` function. A VLAN-related check is optionally performed. The socket buffer is attached to the bridge interface (`br0`) instead of the physical interface (`eth0`), is evaluated by Netfilter and sent back to `netif_receive_skb()`. It will go through the four steps a second time.

IPv4 processing When a device doesn t have a protocol-independent receive handler, a protocol-specific handler will be used:

# cat /proc/net/ptype
Type Device      Function
0800          ip_rcv
0011          llc_rcv [llc]
0004          llc_rcv [llc]
0806          arp_rcv
86dd          ipv6_rcv

Therefore, if the Ethernet type of the incoming frame is 0x800, the socket buffer is handled by ip_rcv(). Among other things, the three following steps will happen:

If the frame destination address is not the MAC address of the incoming interface, not a multicast one and not a broadcast one, the frame is dropped ( not for us ).
Netfilter gets a chance to evaluate the packet (in a PREROUTING chain).
The routing subsystem will decide the destination of the packet in ip_route_input_slow(): is it a local packet, should it be forwarded, should it be dropped, should it be encapsulated? Notably, the reverse-path filtering is done during this evaluation in fib_validate_source().

Reverse-path filtering (also known as uRPF, or unicast reverse-path forwarding, RFC 3704) enables Linux to reject traffic on interfaces which it should never have originated: the source address is looked up in the routing tables and if the outgoing interface is different from the current incoming one, the packet is rejected.

ARP processing When the Ethernet type of the incoming frame is `0x806`, the socket buffer is handled by `arp_rcv()`.

Like for IPv4, if the frame is not for us, it is dropped.

If the incoming device has the `NOARP` flag, the frame is dropped.

Netfilter gets a chance to evaluate the packet (configuration is done with `arptables`).

For an ARP request, the values of `arp_ignore` and `arp_filter` may trigger a drop of the packet.

IPv6 processing When the Ethernet type of the incoming frame is `0x86dd`, the socket buffer is handled by `ipv6_rcv()`.

Like for IPv4, if the frame is not for us, it is dropped.

If IPv6 is disabled on the interface, the packet is dropped.

Netfilter gets a chance to evaluate the packet (in a `PREROUTING` chain).

The routing subsystem will decide the destination of the packet. However, unlike IPv4, there is no reverse-path filtering².

Workarounds There are various methods to fix the situation. We can completely ignore the bridged interfaces: as long as they are attached to the bridge, they cannot process any upper layer protocol (IPv4, IPv6, ARP). Therefore, we can focus on filtering incoming traffic from `br0`. It should be noted that for IPv4, IPv6 and ARP protocols, the MAC address check can be circumvented by using the broadcast MAC address.

Protocol-independent workarounds The four following fixes will indistinctly drop IPv4, ARP and IPv6 packets.

Using VLAN-aware bridge Linux 3.9 introduced the ability to use VLAN filtering on bridge ports. This can be used to prevent any local traffic:

# echo 1 > /sys/class/net/br0/bridge/vlan_filtering
# bridge vlan del dev br0 vid 1 self
# bridge vlan show
port    vlan ids
eth0     1 PVID Egress Untagged
eth2     1 PVID Egress Untagged
eth3     1 PVID Egress Untagged
eth4     1 PVID Egress Untagged
br0     None

This is the most efficient method since the frame is dropped directly in br_pass_frame_up().

Using ingress policy It s also possible to drop the bridged frame early after it has been re-delivered to netif_receive_skb() by br_pass_frame_up(). The ingress policy of an interface is evaluated before any handler. Therefore, the following commands will ensure no local delivery (the source interface of the packet is the bridge interface) happens:

# tc qdisc add dev br0 handle ffff: ingress
# tc filter add dev br0 parent ffff: u32 match u8 0 0 action drop

In my opinion, this is the second most efficient method.

Using ebtables Just before re-delivering the frame to netif_receive_skb(), Netfilter gets a chance to issue a decision. It s easy to configure it to drop the frame:

# ebtables -A INPUT --logical-in br0 -j DROP

However, to the best of my knowledge, this part of Netfilter is known to be inefficient.

Using namespaces Isolation can also be obtained by moving all the bridged interfaces into a dedicated network namespace and configure the bridge inside this namespace:

# ip netns add bridge0
# ip link set netns bridge0 eth0
# ip link set netns bridge0 eth2
# ip link set netns bridge0 eth3
# ip link set netns bridge0 eth4
# ip link del dev br0
# ip netns exec bridge0 brctl addbr br0
# for i in 0 2 3 4; do
>    ip netns exec bridge0 brctl addif br0 eth$i
>    ip netns exec bridge0 ip link set up dev eth$i
> done
# ip netns exec bridge0 ip link set up dev br0

The frame will still wander a bit inside the IP stack, wasting some CPU cycles and increasing the possible attack surface. But ultimately, it will be dropped.

Protocol-dependent workarounds Unless you require multiple layers of security, if one of the previous workarounds is already applied, there is no need to apply one of the protocol-dependent fix below. It s still interesting to know them because it is not uncommon to already have them in place.

ARP The easiest way to disable ARP processing on a bridge is to set the NOARP flag on the device. The ARP packet will be dropped as the very first step of the ARP handler.

# ip link set arp off dev br0
# ip l l dev br0
8: br0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 50:54:33:00:00:04 brd ff:ff:ff:ff:ff:ff

arptables can also drop the packet quite early:

# arptables -A INPUT -i br0 -j DROP

Another way is to set arp_ignore to 2 for the given interface. The kernel will only answer to ARP requests whose target IP address is configured on the incoming interface. Since the bridge interface doesn t have any IP address, no ARP requests will be answered.

# sysctl -qw net.ipv4.conf.br0.arp_ignore=2

Disabling ARP processing is not a sufficient workaround for IPv4. A user can still insert the appropriate entry in its neighbor cache:

# ip neigh replace 192.168.14.3 lladdr 50:54:33:00:00:04 dev eth0
# ping -c 1 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=49 time=1.30 ms
--- 192.168.14.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.309/1.309/1.309/0.000 ms

As the check on the target MAC address is quite loose, they don t even need to guess the MAC address:

# ip neigh replace 192.168.14.3 lladdr ff:ff:ff:ff:ff:ff dev eth0
# ping -c 1 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=49 time=1.12 ms
--- 192.168.14.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.129/1.129/1.129/0.000 ms

IPv4 The earliest place to drop an IPv4 packet is with Netfilter³:

# iptables -t raw -I PREROUTING -i br0 -j DROP

If Netfilter is disabled, another possibility is to enable strict reverse-path filtering for the interface. In this case, since there is no IP address configured on the interface, the packet will be dropped during the route lookup:

# sysctl -qw net.ipv4.conf.br0.rp_filter=1

Another option is the use of a dedicated routing rule. Compared to the reverse-path filtering option, the packet will be dropped a bit earlier, still during the route lookup.

# ip rule add iif br0 blackhole

IPv6 Linux provides a way to completely disable IPv6 on a given interface. The packet will be dropped as the very first step of the IPv6 handler:

# sysctl -qw net.ipv6.conf.br0.disable_ipv6=1

Like for IPv4, it s possible to use Netfilter or a dedicated routing rule.

About the example In the above example, the virtual host get ICMP replies because they are routed through the infrastructure network to Internet (e.g. the hypervisor has a default gateway which also acts as a NAT router to Internet). This may not be the case. If you want to check if you are vulnerable despite not getting an ICMP reply, look at the guest neighbor table to check if you got an ARP reply from the host:

# ip route add 192.168.14.3/32 dev eth0
# ip neigh show dev eth0
192.168.14.3 lladdr 50:54:33:00:00:04 REACHABLE

If you didn t get a reply, you could still have issues with IP processing. Add a static neighbor entry before checking the next step:

# ip neigh replace 192.168.14.3 lladdr ff:ff:ff:ff:ff:ff dev eth0

To check if IP processing is enabled, check the bridge host s network statistics:

# netstat -s   grep "ICMP messages"
    15 ICMP messages received
    15 ICMP messages sent
    0 ICMP messages failed

If the counters are increasing, it is processing incoming IP packets. One-way communication still allows a lot of bad things, like DoS attacks. Additionally, if the hypervisor happens to also act as a router, the reach is extended to the whole infrastructure network, potentially exposing weak devices (e.g. PDU) exposing an SNMP agent. If one-way communication is all that s needed, the attacker can also spoof its source IP address, bypassing IP-based authentication.

A frame can be forcibly routed (L3) instead of bridged (L2) by brouting the packet. This action can be triggered using ebtables.
For IPv6, reverse-path filtering needs to be implemented with Netfilter, using the rpfilter match.
If the br_netfilter module is loaded, net.bridge.bridge-nf-call-ipatbles sysctl has to be set to 0. Otherwise, you also need to use the physdev match to not drop IPv4 packets going through the bridge.

5 March 2017

Vincent Bernat: Netops with Emacs and Org mode

Org mode is a package for Emacs to keep notes, maintain todo lists, planning projects and authoring documents . It can execute embedded snippets of code and capture the output (through Babel). It s an invaluable tool for documenting your infrastructure and your operations. Here are three (relatively) short videos exhibiting Org mode use in the context of network operations. In all of them, I am using my own junos-mode which features the following perks:

syntax highlighting for configuration files,
commit of configuration snippets to remote devices, and
execution of remote commands.

Since some Junos devices can be quite slow, commits and remote executions are done asynchronously¹ with the help of a Python helper. In the first video, I take some notes about configuring BGP add-path feature (RFC 7911). It demonstrates all the available features of junos-mode.

In the second video, I execute a planned operation to enable this feature in production. The document is a modus operandi and contains the configuration to apply and the commands to check if it works as expected. At the end, the document becomes a detailed report of the operation.

In the third video, a cookbook has been prepared to execute some changes. I set some variables and execute the cookbook to apply the change and check the result.

This is a bit of a hack since Babel doesn t have native support for that. Also have a look at ob-async which is a language-independent implementation of the same idea.

9 February 2017

Vincent Bernat: Integration of a Go service with systemd

Unlike other programming languages, Go s runtime doesn t provide a way to reliably daemonize a service. A system daemon has to supply this functionality. Most distributions ship systemd which would fit the bill. A correct integration with systemd is quite straightforward. There are two interesting aspects: readiness & liveness. As an example, we will daemonize this service whose goal is to answer requests with nifty 404 errors:

package main
import (
    "log"
    "net"
    "net/http"
)
func main()  
    l, err := net.Listen("tcp", ":8081")
    if err != nil  
        log.Panicf("cannot listen: %s", err)
     
    http.Serve(l, nil)

You can build it with go build 404.go. Here is the service file, 404.service¹:

[Unit]
Description=404 micro-service
[Service]
Type=notify
ExecStart=/usr/bin/404
WatchdogSec=30s
Restart=on-failure
[Install]
WantedBy=multi-user.target

Readiness The classic way for an Unix daemon to signal its readiness is to daemonize. Technically, this is done by calling fork(2) twice (which also serves other intents). This is a very common task and the BSD systems, as well as some other C libraries, supply a daemon(3) function for this purpose. Services are expected to daemonize only when they are ready (after reading configuration files and setting up a listening socket, for example). Then, a system can reliably initialize its services with a simple linear script:

syslogd
unbound
ntpd -s

Each daemon can rely on the previous one being ready to do its work. The sequence of actions is the following:

syslogd reads its configuration, activates /dev/log, daemonizes.
unbound reads its configuration, listens on 127.0.0.1:53, daemonizes.
ntpd reads its configuration, connects to NTP peers, waits for clock to be synchronized², daemonizes.

With systemd, we would use Type=fork in the service file. However, Go s runtime does not support that. Instead, we use Type=notify. In this case, systemd expects the daemon to signal its readiness with a message to an Unix socket. go-systemd package handles the details for us:

package main
import (
    "log"
    "net"
    "net/http"
    "github.com/coreos/go-systemd/daemon"
)
func main()  
    l, err := net.Listen("tcp", ":8081")
    if err != nil  
        log.Panicf("cannot listen: %s", err)
     
    daemon.SdNotify(false, "READY=1") //  
    http.Serve(l, nil)                //

It s important to place the notification after net.Listen() (in ): if the notification was sent earlier, a client would get connection refused when trying to use the service. When a daemon listens to a socket, connections are queued by the kernel until the daemon is able to accept them (in ). If the service is not run through systemd, the added line is a no-op.

Liveness Another interesting feature of systemd is to watch the service and restart it if it happens to crash (thanks to the Restart=on-failure directive). It s also possible to use a watchdog: the service sends watchdog keep-alives at regular interval. If it fails to do so, systemd will restart it. We could insert the following code just before http.Serve() call:

go func()  
    interval, err := daemon.SdWatchdogEnabled(false)
    if err != nil   interval == 0  
        return
     
    for  
        daemon.SdNotify(false, "WATCHDOG=1")
        time.Sleep(interval / 3)
     
 ()

However, this doesn t add much value: the goroutine is unrelated to the core business of the service. If for some reason, the HTTP part gets stuck, the goroutine will happily continue to send keep-alives to systemd. In our example, we can just do a HTTP query before sending the keep-alive. The internal loop can be replaced with this code:

for  
    _, err := http.Get("http://127.0.0.1:8081") //  
    if err == nil  
        daemon.SdNotify(false, "WATCHDOG=1")
     
    time.Sleep(interval / 3)

In , we connect to the service to check if it s still working. If we get some kind of answer, we send a watchdog keep-alive. If the service is unavailable or if http.Get() gets stuck, systemd will trigger a restart. There is no universal recipe. However, checks can be split into two groups:

Before sending a keep-alive, you execute an active check on the components of your service. The keep-alive is sent only if all checks are successful. The checks can be internal (like in the above example) or external (for example, check with a query to the database).
Each component reports its status, telling if it s alive or not. Before sending a keep-alive, you check the reported status of all components (passive check). If some components are late or reported fatal errors, don t send the keep-alive.

If possible, recovery from errors (for example, with a backoff retry) and self-healing (for example, by reestablishing a network connection) is always better, but the watchdog is a good tool to handle the worst cases and avoid too complex recovery logic. For example, if a component doesn t know how to recover from an exceptional condition³, instead of using panic(), it could signal its situation before dying. Another dedicated component could try to resolve the situation by restarting the faulty component. If it fails to reach an healthy state in time, the watchdog timer will trigger and the whole service will be restarted.

Depending on the distribution, this should be installed in /lib/systemd/system or /usr/lib/systemd/system. Check with the output of the command pkg-config systemd --variable=systemdsystemunitdir.
This highly depends on the NTP daemon used. OpenNTPD doesn t wait unless you use the -s option. ISC NTP doesn t either unless you use the --wait-sync option.
An example of an exceptional condition is to reach the limit on the number of file descriptors. Self-healing from this situation is difficult and it s easy to get stuck in a loop.

Next.

Previous.

Search Results: "bernat"

19 September 2020

2 September 2020

Code The module has the following signature and it installs the specified key for each user and remove keys from retired users the ones we do not specify. iosxr_users: keys: vincent: ssh-rsa AAAAB3NzaC1yc2EAA[ ]ymh+YrVWLZMJR cedric: ssh-rsa AAAAB3NzaC1yc2EAA[ ]RShPA8w/8eC0n

Module definition Starting from the skeleton described in the previous article, we define the module: module_args = dict( keys=dict(type='dict', elements='str', required=True), ) module = AnsibleModule( argument_spec=module_args, supports_check_mode=True ) result = dict( changed=False )

23 August 2020

18 July 2020

Resources Source tarball, Feodra/SUSE RPM Packages available at project s release page Debian packages will be available soon in Unstable. Homepage: https://github.com/rickysarraf/laptop-mode-tools/wiki Mailing List: https://groups.google.com/d/forum/laptop-mode-tools

5 April 2020

9 October 2017

22 September 2017

13 September 2017

12 September 2017

20 August 2017

2 August 2017

3 July 2017

21 June 2017

3 May 2017

Verification Step by step, let s check how everything comes together.

Other considerations Independently of the chosen strategy, here are a few important points to keep in mind when implementing a VXLAN overlay.

Isolation While you may expect VXLAN interfaces to only carry L2 traffic, Linux doesn t disable IP processing. If the destination MAC is a local one, Linux will route or deliver the encapsulated IP packet. Check my post about the proper isolation of a Linux bridge.

12 April 2017

Protocol-independent workarounds The four following fixes will indistinctly drop IPv4, ARP and IPv6 packets.

Using ebtables Just before re-delivering the frame to netif_receive_skb(), Netfilter gets a chance to issue a decision. It s easy to configure it to drop the frame: # ebtables -A INPUT --logical-in br0 -j DROP However, to the best of my knowledge, this part of Netfilter is known to be inefficient.

Protocol-dependent workarounds Unless you require multiple layers of security, if one of the previous workarounds is already applied, there is no need to apply one of the protocol-dependent fix below. It s still interesting to know them because it is not uncommon to already have them in place.

IPv6 Linux provides a way to completely disable IPv6 on a given interface. The packet will be dropped as the very first step of the IPv6 handler: # sysctl -qw net.ipv6.conf.br0.disable_ipv6=1 Like for IPv4, it s possible to use Netfilter or a dedicated routing rule.

5 March 2017

9 February 2017

Code The module has the following signature and it installs the specified key for each user and remove keys from retired users the ones we do not specify.
iosxr_users: keys: vincent: ssh-rsa AAAAB3NzaC1yc2EAA[ ]ymh+YrVWLZMJR cedric: ssh-rsa AAAAB3NzaC1yc2EAA[ ]RShPA8w/8eC0n

Module definition Starting from the skeleton described in the previous article, we define the module:
module_args = dict( keys=dict(type='dict', elements='str', required=True), ) module = AnsibleModule( argument_spec=module_args, supports_check_mode=True ) result = dict( changed=False )

Resources

Source tarball, Feodra/SUSE RPM Packages available at project s release page

Debian packages will be available soon in Unstable.

Homepage: https://github.com/rickysarraf/laptop-mode-tools/wiki

Mailing List: https://groups.google.com/d/forum/laptop-mode-tools

IPv6 Linux provides a way to completely disable IPv6 on a given interface. The packet will be dropped as the very first step of the IPv6 handler:
# sysctl -qw net.ipv6.conf.br0.disable_ipv6=1
Like for IPv4, it s possible to use Netfilter or a dedicated routing rule.